Closed p30arena closed 6 years ago
Hi, sorry for the late reply,
Which filesystem do you use? Do you see any other log output (i.e. in dmesg?) since these are just the regular arangodb startup messages.
@dothebart The File System is XFS
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/cl_ip42-root xfs 50G 5.8G 45G 12% /
devtmpfs devtmpfs 1.9G 0 1.9G 0% /dev
tmpfs tmpfs 1.9G 0 1.9G 0% /dev/shm
tmpfs tmpfs 1.9G 25M 1.9G 2% /run
tmpfs tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
/dev/sda1 xfs 1014M 184M 831M 19% /boot
/dev/mapper/cl_ip42-home xfs 146G 12G 134G 9% /home
tmpfs tmpfs 380M 0 380M 0% /run/user/0
tmpfs tmpfs 380M 0 380M 0% /run/user/1002
tmpfs tmpfs 380M 0 380M 0% /run/user/1001
tmpfs tmpfs 380M 0 380M 0% /run/user/1004
@dothebart arangodb did a restart, why? for the file descriptors hard limit?
We advise the usage of ext4 - since thats where all of our development, testing and CI takes place on. Thus we can neither recommend BTRFS nor XFS.
Usually you should either see reasons for a shutdown in syslog (i.e. OOM killer) or the arangodb supervisor log.
This is the supervisor's log:
2017-06-28T13:17:16Z [10963] INFO {startup} supervisor has forked a child process with pid 10964
2017-08-06T07:23:44Z [10963] INFO {startup} received SIGINT for supervisor; commanding client [10964] to shut down.
2017-08-06T07:24:05Z [10963] INFO {startup} waitpid woke up with return value 10964 and status 15 and DONE = true
2017-08-06T07:42:45Z [2955] INFO {startup} supervisor has forked a child process with pid 2956
2017-08-06T07:45:02Z [2955] INFO {startup} received SIGINT for supervisor; commanding client [2956] to shut down.
2017-08-06T07:45:02Z [2955] INFO {startup} waitpid woke up with return value 2956 and status 256 and DONE = true
2017-08-06T07:45:28Z [3149] INFO {startup} supervisor has forked a child process with pid 3150
2017-08-14T11:00:52Z [3149] INFO {startup} received SIGINT for supervisor; commanding client [3150] to shut down.
2017-08-14T11:00:53Z [3149] INFO {startup} waitpid woke up with return value 3150 and status 15 and DONE = true
2017-08-14T11:05:18Z [30520] INFO {startup} supervisor has forked a child process with pid 30521
2017-08-15T09:46:43Z [30520] INFO {startup} received SIGINT for supervisor; commanding client [30521] to shut down.
2017-08-15T09:46:44Z [30520] INFO {startup} waitpid woke up with return value 30521 and status 256 and DONE = true
2017-08-15T09:54:43Z [4915] INFO {startup} supervisor has forked a child process with pid 4916
2017-08-15T09:54:43Z [4915] INFO {startup} waitpid woke up with return value 4916 and status 11 and DONE = false
2017-08-15T09:54:43Z [4915] ERROR {startup} child 4916 died a horrible death, signal 11
2017-08-15T09:54:43Z [4915] ERROR {startup} child only survived for 0 seconds, this will not work - please fix the error first
2017-08-15T09:56:40Z [4971] INFO {startup} supervisor has forked a child process with pid 4972
2017-08-15T09:56:41Z [4971] INFO {startup} waitpid woke up with return value 4972 and status 11 and DONE = false
2017-08-15T09:56:41Z [4971] ERROR {startup} child 4972 died a horrible death, signal 11
2017-08-15T09:56:41Z [4971] ERROR {startup} child only survived for 1 seconds, this will not work - please fix the error first
2017-08-15T09:57:42Z [5312] INFO {startup} supervisor has forked a child process with pid 5313
2017-08-15T09:57:43Z [5312] INFO {startup} waitpid woke up with return value 5313 and status 11 and DONE = false
2017-08-15T09:57:43Z [5312] ERROR {startup} child 5313 died a horrible death, signal 11
2017-08-15T09:57:43Z [5312] ERROR {startup} child only survived for 1 seconds, this will not work - please fix the error first
2017-08-15T10:10:09Z [5953] INFO {startup} supervisor has forked a child process with pid 5954
2017-08-23T09:51:52Z [5953] INFO {startup} waitpid woke up with return value 5954 and status 11 and DONE = false
2017-08-23T09:51:52Z [5953] ERROR {startup} child process 5954 terminated unexpectedly, signal 11 (SIGSEGV). will now start a new child process
2017-08-23T09:51:52Z [5953] INFO {startup} supervisor has forked a child process with pid 19248
2017-09-10T10:35:15Z [5953] INFO {startup} waitpid woke up with return value 19248 and status 11 and DONE = false
2017-09-10T10:35:15Z [5953] ERROR {startup} child process 19248 terminated unexpectedly, signal 11 (SIGSEGV). will now start a new child process
2017-09-10T10:35:15Z [5953] INFO {startup} supervisor has forked a child process with pid 14579
~
~
~
There was no log for: 2017-08-15T09:54:43Z [4915] ERROR {startup} child 4916 died a horrible death, signal 11
Quoted from @jsteemann
Signal 11 in the supervisor logs indicates a SIGSEGV (segmentation violation), which is not supposed to happen.
But this error is from a month ago, I can remember that I ignored and deleted the WAL Recovery file to make the daemon running. I lost some data, but it was on development. I wonder why I didn't report it...
but the issue I'm talking about, happened on September 11th, some of my collections reverted to somewhere between 1st and 5th of September.
you should try to provide coredumps of the incidents - is it enabled in your system?
And I can only advise you to put /var/lib/arangodb onto an ext4 partition.
@dothebart
I switched to ext4.
I restored some data using arangorestore, after that I restarted the server using:
shutdown -r now
After rebooting, I checked the database, all data was lost.
I had to restore them again.
Another interesting error occurring is that, I can't grant a user to a database.
users.grantDatabase('yabco-dev', 'yabco-dev');
Is successful, but AQL:Forbidden
The default access was No Access
I changed it from the admin panel, I thought grantDatabase should work.
please take note that the data loss was not related to the permission issue, because I was using the root account, after restoring I could see the collections.
Ok, do you still see any crashes? if yes, can you try to produce a coredump?
Do I get that correctly that the restore data loss was related to permission / mounting?
Did you correctly mount the data partition via the /etc/fstab
?
@dothebart Fortunately we haven't experienced crashes on the new server yet. I can't remember enough evidence for restoration data loss. All I can remember is that I restored data, then I restarted the server, then using root I browsed the db through admin panel. I saw no collections, so I did another restoration, reloaded the page, then I saw the collections. I'm going to run a test on a vm.
@dothebart I could reproduce it, 1- install arangodb 2- restore two databases (yabco and yabco-dev) 3- restart server 4- collections list are available both on yabco and yabco-dev BUT, collections of yabco-dev are empty, yabco is ok this is the dump of yabco-dev: https://ufile.io/pqub6 5- I restored yabco-dev again, restarted the server, data is persisted Why is it happening only for the first restoration? and only for yabco-dev?
the cmd for restoration:
arangorestore --server.username root --server.password "xxx" --server.database yabco-dev --input-directory "dump-dev"
this is the arangod.log: https://ufile.io/ex5xt The WAL recovery at the end is natural and is related to the unexpected shutdown I made to the VM.
The filesystem of vm is xfs, the filesystem of the server is ext4. Its not related to filesystem.
@dothebart @jsteemann Guys, any news?
Since we usually don't spend our weekends at work - not yet. Will have a look at this today.
Having a look at these lines:
2017-09-22T17:28:40Z [1463] INFO using endpoint 'http+tcp://127.0.0.1:8529' for non-encrypted requests
2017-09-22T17:28:43Z [1463] INFO ArangoDB (version 3.2.3 [linux]) is ready for business. Have fun!
2017-09-22T17:32:35Z [1639] INFO ArangoDB 3.2.3 [linux] 64bit, using jemalloc, VPack 0.1.30, RocksDB 5.6.0, ICU 58.1, V8 5.7.0.0, OpenSSL 1.0.1e-fips 11 Feb 2013
2017-09-22T17:32:36Z [1639] WARNING {memory} /sys/kernel/mm/transparent_hugepage/enabled is set to 'always'. It is recommended to set it to a value of 'never' or 'madvise'
the shutdown message (INFO ArangoDB has been shut down
) is missing?
So we have an unclean shutdown inbetween the two. I've tried to minimize the examples using arangosh, but couldn't reproduce your incident:
./bin/arangod --server.storage-engine rocksdb /tmp/1234 --console
db = require("internal").db
db._createDatabase("db1")
db._createDatabase("db2")
db._useDatabase("db1")
db._create("testcol")
db.testcol.save({"abc": "abc"})
db._useDatabase("db2")
db._create("testcol")
db.testcol.save({"abc": "cde"})
And then restart, try to access the db's and its collections contents:
db = require("internal").db
db._createDatabase("db1")
db._useDatabase("db1")
db._create("testcol")
db.testcol.save({"abc": "abc"})
and the saved DBs, Collections and Documents were all accessible. Which size (count) do your restores have? Number of collections, size & count of the collection data?
ok - I remember something. This is https://github.com/arangodb/arangodb/issues/2437 - you musn't restore your data twice to the same arangodb instance.
If you want the same dataset available twice, you need to run another instance of arangodb.
@dothebart I can confirm that some of the data was duplicate in both databases. So what do you suggest now? when will this BUG be resolved?
I'm on Cent OS 7 Can you help me on making separate instances of arangod?
I start it using this command:
service arangodb3 start
I created this file: /usr/lib/systemd/system/arangodb3-dev.service
# This file is part of package arangodb.
#
# Description:
#
# Used to start ArangoDB
#
# A universal open-source database with a flexible data model
# for documents, graphs, and key-values. Build high
# performance applications using a convenient sql-like query
# language or JavaScript extensions.
#
# http://www.arangodb.com
#
[Unit]
Description=ArangoDB database server
After=sysinit.target sockets.target timers.target paths.target slices.target network.target syslog.target
[Service]
Type=simple
LimitNOFILE=131072
PIDFile=/var/run/arangodb3-dev/arangod.pid
Environment=GLIBCXX_FORCE_NEW=1
# Protect users from making their installation unusable by
# starting arangod with wrong permissions (e.g. as root).
# This will reset the permissions to the working default.
ExecStartPre=/usr/bin/install -g arangodb -o arangodb -d /var/tmp/arangodb3-dev
ExecStartPre=/usr/bin/install -g arangodb -o arangodb -d /var/run/arangodb3-dev
ExecStartPre=/bin/chown -R arangodb:arangodb /var/log/arangodb3-dev
ExecStartPre=/bin/chmod 700 /var/log/arangodb3-dev
ExecStartPre=/bin/chown -R arangodb:arangodb /var/lib/arangodb3-dev
ExecStartPre=/bin/chmod 700 /var/lib/arangodb3-dev
ExecStartPre=/bin/chown -R arangodb:arangodb /var/lib/arangodb3-dev-apps
ExecStartPre=/bin/chmod 700 /var/lib/arangodb3-dev-apps
ExecStart=/usr/sbin/arangod --uid arangodb --gid arangodb --pid-file /var/run/arangodb3-dev/arangod.pid --temp.path /var/tmp/arangodb3-dev --log.foreground-tty true
TimeoutStopSec=3600
TimeoutSec=3600
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Should I duplicate the contents of this dir: /etc/arangodb3/ into a new one: /etc/arangodb3-dev/ ??? Then how would I introduce new config files to the service file?
Yes, you would copy the /etc
directory. You would pass it along to arangod with -c /etc/arangodb3-dev/arangod.conf
. Inside of that you specify the other database and app paths.
@dothebart Isn't it better to rename the collections on dev and give them a prefix like: dev_users dev_sites ... ?
2- How can I get the service name in Foxx? using context.mount OR context.collectionPrefix ? 3- Should I Monkey Patch db._collection ?
This is how I access collections:
const usersCollection = db._collection('users');
const is_dev = module.context.collectionPrefix.indexOf('dev') >= 0;
if(is_dev) {
patch_collection();
}
function patch_collection() {
const db = require('@arangodb').db;
const c = db._collection;
db._collection = (cname) => c('DEV_' + cname);
}
I get these errors:
Failed to load Foxx service mounted at "/yabco-service-dev" via ArangoError: duplicate name
Failed to load Foxx service mounted at "/yabco-service-dev" via ArangoError: service files outdated
Don't tell me that its similar to global bug I encountered before... I used to use global in both my services and messed up everything
UPDATE: I didn't have enough time to play with the above issues, so I just wiped out my dev DB and set its collections keys to start from 1 with an incremental by 1 so we loose the chance of another collision for short term.
Yes, as you suspected the Foxx code you have is indeed similar to the previous "global bug". You're monkeypatching ArangoDB methods. It's pretty much the same class of problem. Please don't do that.
In general, it's a very bad idea to have implicit global state (as with your earlier problem) or patch external code (read: code that isn't part of your service and that you didn't write). The Ruby community had to learn this the hard way with Rails, the JavaScript community learned this after it had become en vogue to add utility methods to built-in prototypes. It's also true in ArangoDB and Foxx.
If you want the collection method to behave differently, don't use it directly but write a helper function you call instead and have it behave differently depending on the environment.
It looks like you're monkeypatching your own version of module.context.collectionPrefix
into db._collection
with the difference that instead of always appending a prefix based on the mount point you only append a specific prefix if the mount point matches a specific pattern.
Just refactor that function and save it as a utility module you can import in your service:
'use strict';
var db = require('@arangodb').db;
module.exports = function (collectionName) {
if (module.context.mount.indexOf('dev') !== -1) {
collectionName = 'DEV_' + collectionName;
}
return db._collection(collectionName);
};
Put that file in your service and require the function wherever you need to access a collection by name.
Alternatively, just follow the documentation's example and use module.context.collection(localName)
instead of db._collection(actualName)
and use module.context.collectionName(localName)
to create the collections in your setup script as needed. This means you'll end up with longer collection names because of the slightly more verbose prefixes but it means you don't have to worry about whether this is the dev copy of the service, the production one or a staging one, meaning you won't have to think about this again when you inevitably find out you need more than just dev+prod.
In ArangoDB all Foxx services on the same instance live in the same universe together, blissfully unaware of each other. Consider the global object and ArangoDB built-ins or ArangoDB-provided third-party modules like a public shared space in the real world: if someone starts littering or vandalising it, it affects everyone else too. If you redecorate the town hall, everyone can see it.
Thanks @pluma But I took another naive approach that worked out well, I separated the key space of dev and production.
dev collection keys start from 1 prod keys start from 10,000,000
Yeap, definetely #2437. It drove me nuts.
ok, closing this as a duplicate of https://github.com/arangodb/arangodb/issues/2437 then.
my environment running ArangoDB
I'm using the latest ArangoDB of the respective release series:
Mode:
Storage-Engine:
On this operating system:
LOGS: