Closed martincox closed 5 years ago
@matthewvon - long shot after a few years, but does this sound at all familiar?
A possibly related instance here: https://github.com/basho/riak/issues/785
@martincox Is there any isolation from raw addresses to source code? Maybe a core file? The code used to be built with debug data still attached solely for isolating issues like this.
And/or there was a technique to get the source code function addresses from the .so. I believe we used the Linux command line tool “nm”.
Cheers for replying @matthewvon - did a bit of digging and discovered that in the absence of a dump (not configured - doh!), we can apparently determine the borked point in code by taking the address of the exception away from the instruction pointer. So we did this and addr2line gives us:
printf "0x%X\n" $((0x7ff744c7e58f - 0x7ff744c25000)) " = 0x5958F
addr2line -e /usr/lib64/riak/lib/eleveldb-2.0.35-0-gc565e46/priv/eleveldb.so -i 0x5958F
iterator_wrapper.h:41
two_level_iterator.cc:152
Now have a point at which to start looking around, although, cpp is not my bag - leveldb is pretty much a black box of magic.
https://github.com/basho/eleveldb/pull/221
In November of 2016, the Basho team discovered that AAE (active anti-entropy) code was using a single leveldb iterator in at least two Erlang processes. The above pull request was one of several changes made to better defend against the bad Erlang code. The Erlang code was never corrected to the best of my knowledge. Disabling AAE could lead to the product stabilizing. I would guess that AAE might be transitioning to/from a tree rebuild 2 hours after start up.
I would further expect that the memory fencing code / atomics within eleveldb layer should get another review in all things iterator. It could be that the reference counting needs better atomic objects.
Let me know if you get a core. Would also be useful to know AAE activity on the crashed server. Tree rebuilds happen at regular intervals, so knowledge of prior activity could help determine potential event correlation.
I have found notes from 2014 discussing that two AAE processes use the same iterator. So the November 2016 "rediscovery" was not actually when the problem first became known.
I am working on attempting to re-create the general protection fault by utilising the same iterator in two processes (out side of riak), if I manage to do so I will provide the core dump.
Additional information that maybe useful:
There is a bug in riak, whereby on shutdown we attempt to re-build (re-hash) any AAE tree's that have there state as 'next_build=incremental'. With large enough AAE tree's this takes longer than 60 seconds and causes the riak_core_vnode to crash on shutdown. This would therefore also crash the eleveldb process.
This causes some tree's to be re-built from scratch on startup. There were 5 AAE exchanges (different partitions) on the affected node in the 1 hour 30 minutes it was up for before then general protection fault.
eleveldb version: 2.0.35
To be clear, leveldb was never designed to have two threads using the same iterator. Fixing Riak's AAE is the real goal.
But I am willing to work on hardening the iterators more. My skill with atomics has increased since I last worked in eleveldb.
There was a goal of using as few mutex objects as possible due to performance loss when swapping between Riak scheduler threads and eleveldb threads. It may be time to abandon that goal.
That makes sense. Yeah the goal I wish to achieve is to try and re-create the issue in test, just to prove off that is the cause of the general protections fault.
If that's the case, then I will be happy to proceed with work in riak_core on hashtree.erl, and in riak_kv for riak_kv_index_hashtree.erl to remove the possibility of the same iterator being used in two processes to ensure this does not happen again
Okay so there is actually code to provide a locking mechanism for AAE.
We have managed to pin point the race condition on a node shutdown we can cause that locking mechanism to be violated, and allow the same iterator to be used in two different processes (on a different node) which then causes a general protection kernel fault, or a segfault.
Really appreciate your help, wouldn't of had the first idea about where to look. We will implement a fix for this and patch AAE.
Thanks again
Observed a crashed node which looks to be as a result of a general protection error in eleveldb:
This occurred around 2 hours after the node had been restarted. I'd seen similar issues reported which looked to be occurring on startup, which had been resolved by @matthewvon in 2.0.11 - http://lists.basho.com/pipermail/riak-users_lists.basho.com/2016-January/037662.html
We don't really have much more detail than this at the moment, but will continue to investigate and update as we progress.