General Protection error caused BEAM process to terminate

martincox commented 5 years ago

Observed a crashed node which looks to be as a result of a general protection error in eleveldb:

kernel: Eleveldb[55740] general protection ip:7ff744c7e58f sp:7f
f67b1c3bc0 error:0 in eleveldb.so[7ff744c25000+95000]

This occurred around 2 hours after the node had been restarted. I'd seen similar issues reported which looked to be occurring on startup, which had been resolved by @matthewvon in 2.0.11 - http://lists.basho.com/pipermail/riak-users_lists.basho.com/2016-January/037662.html

We don't really have much more detail than this at the moment, but will continue to investigate and update as we progress.

martincox commented 5 years ago

@matthewvon - long shot after a few years, but does this sound at all familiar?

martincox commented 5 years ago

A possibly related instance here: https://github.com/basho/riak/issues/785

matthewvon commented 5 years ago

@martincox Is there any isolation from raw addresses to source code? Maybe a core file? The code used to be built with debug data still attached solely for isolating issues like this.

matthewvon commented 5 years ago

And/or there was a technique to get the source code function addresses from the .so. I believe we used the Linux command line tool “nm”.

martincox commented 5 years ago

Cheers for replying @matthewvon - did a bit of digging and discovered that in the absence of a dump (not configured - doh!), we can apparently determine the borked point in code by taking the address of the exception away from the instruction pointer. So we did this and addr2line gives us:

printf "0x%X\n" $((0x7ff744c7e58f - 0x7ff744c25000)) " = 0x5958F
addr2line -e /usr/lib64/riak/lib/eleveldb-2.0.35-0-gc565e46/priv/eleveldb.so -i 0x5958F
iterator_wrapper.h:41
two_level_iterator.cc:152

Now have a point at which to start looking around, although, cpp is not my bag - leveldb is pretty much a black box of magic.

matthewvon commented 5 years ago

https://github.com/basho/eleveldb/pull/221

In November of 2016, the Basho team discovered that AAE (active anti-entropy) code was using a single leveldb iterator in at least two Erlang processes. The above pull request was one of several changes made to better defend against the bad Erlang code. The Erlang code was never corrected to the best of my knowledge. Disabling AAE could lead to the product stabilizing. I would guess that AAE might be transitioning to/from a tree rebuild 2 hours after start up.

I would further expect that the memory fencing code / atomics within eleveldb layer should get another review in all things iterator. It could be that the reference counting needs better atomic objects.

matthewvon commented 5 years ago

More notes here: https://github.com/basho/eleveldb/commit/22b4869b4e2c2dd57ab7e6dcc6ba0277746143c2#diff-9288aa37de4ae27b03d2ecafd94d1f09R540

matthewvon commented 5 years ago

Let me know if you get a core. Would also be useful to know AAE activity on the crashed server. Tree rebuilds happen at regular intervals, so knowledge of prior activity could help determine potential event correlation.

I have found notes from 2014 discussing that two AAE processes use the same iterator. So the November 2016 "rediscovery" was not actually when the problem first became known.

nsaadouni commented 5 years ago

I am working on attempting to re-create the general protection fault by utilising the same iterator in two processes (out side of riak), if I manage to do so I will provide the core dump.

Additional information that maybe useful:

There is a bug in riak, whereby on shutdown we attempt to re-build (re-hash) any AAE tree's that have there state as 'next_build=incremental'. With large enough AAE tree's this takes longer than 60 seconds and causes the riak_core_vnode to crash on shutdown. This would therefore also crash the eleveldb process.

This causes some tree's to be re-built from scratch on startup. There were 5 AAE exchanges (different partitions) on the affected node in the 1 hour 30 minutes it was up for before then general protection fault.

eleveldb version: 2.0.35

matthewvon commented 5 years ago

To be clear, leveldb was never designed to have two threads using the same iterator. Fixing Riak's AAE is the real goal.

But I am willing to work on hardening the iterators more. My skill with atomics has increased since I last worked in eleveldb.

There was a goal of using as few mutex objects as possible due to performance loss when swapping between Riak scheduler threads and eleveldb threads. It may be time to abandon that goal.

nsaadouni commented 5 years ago

That makes sense. Yeah the goal I wish to achieve is to try and re-create the issue in test, just to prove off that is the cause of the general protections fault.

If that's the case, then I will be happy to proceed with work in riak_core on hashtree.erl, and in riak_kv for riak_kv_index_hashtree.erl to remove the possibility of the same iterator being used in two processes to ensure this does not happen again

nsaadouni commented 5 years ago

Okay so there is actually code to provide a locking mechanism for AAE.

We have managed to pin point the race condition on a node shutdown we can cause that locking mechanism to be violated, and allow the same iterator to be used in two different processes (on a different node) which then causes a general protection kernel fault, or a segfault.

Really appreciate your help, wouldn't of had the first idea about where to look. We will implement a fix for this and patch AAE.

Thanks again

basho / eleveldb

General Protection error caused BEAM process to terminate #251