basho / eleveldb

Erlang LevelDB API
261 stars 178 forks source link

Deadlock between MakeRoomForWrite() and AwaitCloseAndDestructor() #108

Closed slfritchie closed 10 years ago

slfritchie commented 10 years ago

eleveldb revision: 639f69c96757cc5ddaf2bbacc0ef10a6e405c7bf leveldb revision: d553b6c4bf525c0301b3d2f128be0a63df930ee4

I see a deadlock that immobilizes the entire scheduler thread. The first suspicious stacktrace is:

Thread 17 (process 20406):
#0  0x0000000017a5f0fa in __psynch_cvwait ()
#1  0x00000000178f3fb9 in _pthread_cond_wait ()
#2  0x0000000023604bbb in eleveldb::ErlRefObject::AwaitCloseAndDestructor () at /Users/fritchie/b/src/eleveldb/c_src/refobjects.h:155
#3  0x0000000023604bbb in eleveldb::DbObject::DbObjectResourceCleanup (Env=<value temporarily unavailable, due to optimizations>, Arg=0x20a40090) at c_src/refobjects.cc:274
#4  0x000000001714e40c in nif_resource_dtor (bin=0x20a40068) at beam/erl_nif.c:1303
#5  0x000000001706de05 in erts_bin_free (bp=0x20a40068) at erl_binary.h:303
#6  0x000000001706dd2a in erts_cleanup_offheap (offheap=0x211cf110) at beam/erl_message.c:174
#7  0x0000000017081c3b in delete_process (p=0x211ceeb0) at beam/erl_process.c:8001
#8  0x0000000017080fc7 in erts_continue_exit_process (p=0x211ceeb0) at beam/erl_process.c:9034
#9  0x000000001708084d in erts_do_exit_process (p=0x211ceeb0, reason=17035) at beam/erl_process.c:8839
#10 0x00000000171ff749 in terminate_proc (c_p=0x211ceeb0, Value=17035) at beam/beam_emu.c:5281
#11 0x00000000171fc29b in handle_error (c_p=0x211ceeb0, pc=0x0, reg=0x208081c0, bf=0x0) at beam/beam_emu.c:5156
#12 0x00000000171f299d in process_main () at beam/beam_emu.c:3276
#13 0x00000000170764d5 in sched_thread_func (vesdp=0x21414600) at beam/erl_process.c:5801
#14 0x00000000172d29c4 in thr_wrapper (vtwd=0x7fff58c33dd8) at pthread/ethread.c:106
#15 0x00000000178ef772 in _pthread_start ()
#16 0x00000000178dc1a1 in thread_start ()

And the second:

Thread 73 (process 20406):
#0  0x0000000017a5f0fa in __psynch_cvwait ()
#1  0x00000000178f3fb9 in _pthread_cond_wait ()
#2  0x000000002363aa6f in leveldb::port::CondVar::Wait (this=0x7fa493200a30) at port_posix.h:48
#3  0x000000002360fed3 in leveldb::DBImpl::MakeRoomForWrite (this=0x7fa493200880, force=<value temporarily unavailable, due to optimizations>) at db/db_impl.cc:1836
#4  0x000000002360f7a8 in leveldb::DBImpl::Write (this=0x7fa493200880, options=@0x7fa493406140, my_batch=0x7fa493400d60) at db/db_impl.cc:1635
#5  0x00000000236044d9 in leveldb::Status::ok () at /Users/fritchie/b/src/eleveldb/c_src/leveldb/include/leveldb/status.h:159
#6  0x00000000236044d9 in eleveldb::WriteTask::operator() (this=0x7fa493404e10) at workitems.h:161
#7  0x0000000023606893 in eleveldb::eleveldb_thread_pool::notify_caller (work_item=@0x7fa493404e10) at c_src/threading.cc:309
#8  0x0000000023606838 in eleveldb::WorkTask::resubmit () at /Users/fritchie/b/src/eleveldb/c_src/workitems.h:373
#9  0x0000000023606838 in eleveldb::eleveldb_write_thread_worker (args=0x7fa4934028e0) at c_src/threading.cc:374
#10 0x00000000171355b5 in erl_drv_thread_wrapper (vdtid=0x20f845c0) at beam/erl_drv_thread.c:117
#11 0x00000000172d29c4 in thr_wrapper (vtwd=0xb08bb988) at pthread/ethread.c:106
#12 0x00000000178ef772 in _pthread_start ()
#13 0x00000000178dc1a1 in thread_start ()
slfritchie commented 10 years ago

"LOG" file contents:

2014/03/28-17:34:55.454901 b1873000             Options.comparator: leveldb.InternalKeyComparator
2014/03/28-17:34:55.454930 b1873000      Options.create_if_missing: 1
2014/03/28-17:34:55.454935 b1873000        Options.error_if_exists: 0
2014/03/28-17:34:55.454939 b1873000        Options.paranoid_checks: 0
2014/03/28-17:34:55.454942 b1873000     Options.verify_compactions: 1
2014/03/28-17:34:55.454946 b1873000                    Options.env: 0x7fe23b8009c0
2014/03/28-17:34:55.454949 b1873000               Options.info_log: 0x7fe23b804da0
2014/03/28-17:34:55.454953 b1873000      Options.write_buffer_size: 16777216
2014/03/28-17:34:55.454956 b1873000         Options.max_open_files: 1000
2014/03/28-17:34:55.454959 b1873000            Options.block_cache: 0x7fe23c006e00
2014/03/28-17:34:55.454962 b1873000             Options.block_size: 4096
2014/03/28-17:34:55.454966 b1873000       Options.block_size_steps: 16
2014/03/28-17:34:55.454969 b1873000 Options.block_restart_interval: 16
2014/03/28-17:34:55.454972 b1873000            Options.compression: 0
2014/03/28-17:34:55.454975 b1873000          Options.filter_policy: NULL
2014/03/28-17:34:55.454979 b1873000              Options.is_repair: false
2014/03/28-17:34:55.454982 b1873000         Options.is_internal_db: false
2014/03/28-17:34:55.454985 b1873000      Options.total_leveldb_mem: 0
2014/03/28-17:34:55.454988 b1873000  Options.block_cache_threshold: 33554432
2014/03/28-17:34:55.454992 b1873000  Options.limited_developer_mem: false
2014/03/28-17:34:55.454995 b1873000       Options.delete_threshold: 1000
2014/03/28-17:34:55.454998 b1873000       Options.fadvise_willneed: false
2014/03/28-17:34:55.455001 b1873000                         crc32c: hardware
2014/03/28-17:34:55.455005 b1873000                File cache size: 278921216
2014/03/28-17:34:55.455008 b1873000               Block cache size: 281018368
2014/03/28-17:34:55.465354 b1873000 Delete type=3 #1
2014/03/28-17:34:55.465457 b1873000 UpdatePenalty: 0
2014/03/28-17:34:55.493983 b1873000 UpdatePenalty: 0
2014/03/28-17:34:55.494453 b3abc000 Level-0 table #5: started
2014/03/28-17:34:55.530660 b1873000 waiting 2...
2014/03/28-17:34:55.530681 b1873000 UpdatePenalty: 0
2014/03/28-17:34:55.591396 b3abc000 Level-0 table #5: 16847612 bytes, 170 keys IO error: /tmp/generic.qc/sst_0/000005.sst: No space left on device
2014/03/28-17:34:55.591427 b3abc000 Waiting after background imm compaction error: IO error: /tmp/generic.qc/sst_0/000005.sst: No space left on device
2014/03/28-17:34:55.591462 b1873000 running 2...
2014/03/28-17:34:55.591477 b1873000 waiting 2...
2014/03/28-17:34:55.591483 b1873000 UpdatePenalty: 0
2014/03/28-17:34:56.592507 b3abc000 UpdatePenalty: 0
2014/03/28-17:34:56.592592 b1873000 running 2...
2014/03/28-17:34:56.592617 b1873000 waiting 2...
2014/03/28-17:34:56.592635 b1873000 UpdatePenalty: 0
slfritchie commented 10 years ago

If I remove an exit() call on the Erlang side, I see a deadlock involving only 1 thread: at the same place as above, the leveldb::DBImpl::MakeRoomForWrite (this=0x7fd9a3100a80, force=<value temporarily unavailable, due to optimizations>) at db/db_impl.cc:1836 thread.

jonmeredith commented 10 years ago

Does @matthewvon How do you reproduce the issue @slfritchie ?

slfritchie commented 10 years ago

I'm fighting QuickCheck to get a somewhat-deterministic case.

matthewvon commented 10 years ago

Don't waste your time making a deterministic case. I see what is happening … shutdown with pending writes and full disk. Compaction knows to stop attempting compactions, but write thread does not know to quit waiting upon compactions.

matthewvon commented 10 years ago

Add:

if (!shuttingdown.Acquire_Load())

before each of the bgcv.Wait() calls in MakeRoomForWrite() … and all will be fine.

slfritchie commented 10 years ago

Matthew, I suspect that there's a larger problem, and the 2-thread case above got caught in it?

The larger problem appears to be that a eleveldb_thread_pool thread appears to get stuck in leveldb::DBImpl::Write and leveldb::DBImpl::MakeRoomForWrite while all other threads appear sitting in their main condvar wait loops? {shrug} It's quite easy and quick to reproduce a case of this, where close() is not a factor.

I've added a new command to the QuickCheck model that will write 10s/100s/1000s of keys sequentially (without other model operations interleaved). The new command is called 'put_filler'. An example test case looks like:

[{set,{var,1},
           {call,generic_qc_fsm,set_keys,
                 [[<<249,57,0,0,0,0>>],{var,parameter_test_dir}]}},
      {set,{var,2},
           {call,generic_qc_fsm,open,
                 [{var,parameter_test_dir},
                  [read_write,
                   {open_timeout,0},
                   {max_file_size,100},
                   {sync_strategy,none},
                   {create_if_missing,true},
                   {limited_developer_mem,true},
                   {compression,false},
                   {write_buffer_size,16000000}]]}},
      {set,{var,3},
           {call,generic_qc_fsm,put_filler,[{var,2},{1273,<<"]+">>},34084]}},
      {set,{var,4},
           {call,generic_qc_fsm,put,[{var,2},<<249,57,0,0,0,0>>,<<144,2>>]}},
      {set,{var,5},
           {call,generic_qc_fsm,put_filler,
                 [{var,2},{2462,<<17,106>>},113163]}},
      {set,{var,6},
           {call,generic_qc_fsm,put,[{var,2},<<249,57,0,0,0,0>>,<<"@">>]}},
      {set,{var,7},{call,generic_qc_fsm,get,[{var,2},<<249,57,0,0,0,0>>]}},
      {set,{var,8},
           {call,generic_qc_fsm,put_filler,[{var,2},{2853,<<23>>},33185]}},
      {set,{var,9},{call,generic_qc_fsm,delete,[{var,2},<<249,57,0,0,0,0>>]}}]

Without fault injection, this kind of test case has no problem: I can run for hours without a single problem. However, it's impossible to go for 1 minute without a problem caused by the put_filler command getting stuck. In the above case, the first filler command writes 1273 keys of with values of 34084 bytes. That command worked just fine. The 2nd filler tries to write 2462 keys with values of 113163 bytes and gets stuck.

Stack traces from all threads is available at https://gist.github.com/slfritchie/fb238f5c48ff57788507

The "LOG" file looks like this after getting stuck:

2014/04/01-12:03:13.225374 b1873000             Options.comparator: leveldb.InternalKeyComparator
2014/04/01-12:03:13.225404 b1873000      Options.create_if_missing: 1
2014/04/01-12:03:13.225409 b1873000        Options.error_if_exists: 0
2014/04/01-12:03:13.225413 b1873000        Options.paranoid_checks: 0
2014/04/01-12:03:13.225416 b1873000     Options.verify_compactions: 1
2014/04/01-12:03:13.225420 b1873000                    Options.env: 0x7ff70bb00460
2014/04/01-12:03:13.225423 b1873000               Options.info_log: 0x7ff70ba00740
2014/04/01-12:03:13.225427 b1873000      Options.write_buffer_size: 16000000
2014/04/01-12:03:13.225430 b1873000         Options.max_open_files: 1000
2014/04/01-12:03:13.225434 b1873000            Options.block_cache: 0x7ff711800e00
2014/04/01-12:03:13.225437 b1873000             Options.block_size: 4096
2014/04/01-12:03:13.225440 b1873000       Options.block_size_steps: 16
2014/04/01-12:03:13.225444 b1873000 Options.block_restart_interval: 16
2014/04/01-12:03:13.225447 b1873000            Options.compression: 0
2014/04/01-12:03:13.225450 b1873000          Options.filter_policy: NULL
2014/04/01-12:03:13.225454 b1873000              Options.is_repair: false
2014/04/01-12:03:13.225457 b1873000         Options.is_internal_db: false
2014/04/01-12:03:13.225460 b1873000      Options.total_leveldb_mem: 0
2014/04/01-12:03:13.225463 b1873000  Options.block_cache_threshold: 33554432
2014/04/01-12:03:13.225467 b1873000  Options.limited_developer_mem: false
2014/04/01-12:03:13.225470 b1873000       Options.delete_threshold: 1000
2014/04/01-12:03:13.225473 b1873000       Options.fadvise_willneed: false
2014/04/01-12:03:13.225477 b1873000                         crc32c: hardware
2014/04/01-12:03:13.225480 b1873000                File cache size: 280475648
2014/04/01-12:03:13.225484 b1873000               Block cache size: 282572800
2014/04/01-12:03:13.228286 b1873000 Delete type=3 #1
2014/04/01-12:03:13.228381 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.256861 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.256888 b3abc000 Level-0 table #5: started
2014/04/01-12:03:13.293151 b1873000 waiting 2...
2014/04/01-12:03:13.293172 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.357334 b3abc000 Level-0 table #5: 16013598 bytes, 469 keys OK
2014/04/01-12:03:13.385372 b3abc000 Delete type=0 #3
2014/04/01-12:03:13.385515 b3abc000 UpdatePenalty: 0
2014/04/01-12:03:13.385562 b1873000 running 2...
2014/04/01-12:03:13.385825 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.385888 b3abc000 Level-0 table #7: started
2014/04/01-12:03:13.421676 b1873000 waiting 2...
2014/04/01-12:03:13.421701 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.524520 b3abc000 Level-0 table #7: 16013598 bytes, 469 keys OK
2014/04/01-12:03:13.526535 b3abc000 Delete type=0 #4
2014/04/01-12:03:13.526658 b3abc000 UpdatePenalty: 0
2014/04/01-12:03:13.526731 b1873000 running 2...
2014/04/01-12:03:13.527142 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.527162 b3abc000 Level-0 table #11: started
2014/04/01-12:03:13.556371 b1873000 waiting 2...
2014/04/01-12:03:13.556392 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.612630 b3abc000 Level-0 table #11: 16080534 bytes, 377 keys OK
2014/04/01-12:03:13.612948 b3abc000 Delete type=0 #6
2014/04/01-12:03:13.613117 b3abc000 UpdatePenalty: 0
2014/04/01-12:03:13.613141 b1873000 running 2...
2014/04/01-12:03:13.613274 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.613290 b3abc000 Level-0 table #13: started
2014/04/01-12:03:13.642876 b1873000 waiting 2...
2014/04/01-12:03:13.642898 b1873000 UpdatePenalty: 0
2014/04/01-12:03:13.710699 b3abc000 Level-0 table #13: 16077771 bytes, 142 keys IO error: /tmp/generic.qc.1396.321393.224161/sst_0/000013.sst: Input/output error
2014/04/01-12:03:13.710725 b3abc000 Waiting after background imm compaction error: IO error: /tmp/generic.qc.1396.321393.224161/sst_0/000013.sst: Input/output error
2014/04/01-12:03:13.710759 b1873000 running 2...
2014/04/01-12:03:13.710774 b1873000 waiting 2...
2014/04/01-12:03:13.710781 b1873000 UpdatePenalty: 0
2014/04/01-12:03:14.711505 b3abc000 UpdatePenalty: 0
2014/04/01-12:03:14.711628 b1873000 running 2...
2014/04/01-12:03:14.711643 b1873000 waiting 2...
2014/04/01-12:03:14.711650 b1873000 UpdatePenalty: 0
slfritchie commented 10 years ago

Drat, I forgot to mention that the comment immediately above also has your if patch suggestion.

diff --git a/db/db_impl.cc b/db/db_impl.cc
index a67b1b3..23c9ffc 100644
--- a/db/db_impl.cc
+++ b/db/db_impl.cc
@@ -1833,14 +1833,14 @@ Status DBImpl::MakeRoomForWrite(bool force) {
       Log(options_.info_log, "waiting 2...\n");
       gPerfCounters->Inc(ePerfWriteWaitImm);
       MaybeScheduleCompaction();
-      bg_cv_.Wait();
+if (!shutting_down_.Acquire_Load())      bg_cv_.Wait();
       Log(options_.info_log, "running 2...\n");
     } else if (versions_->NumLevelFiles(0) >= config::kL0_StopWritesTrigger) {
       // There are too many level-0 files.
       Log(options_.info_log, "waiting...\n");
       gPerfCounters->Inc(ePerfWriteWaitLevel0);
       MaybeScheduleCompaction();
-      bg_cv_.Wait();
+if (!shutting_down_.Acquire_Load())      bg_cv_.Wait();
       Log(options_.info_log, "running...\n");
     } else {
       // Attempt to switch to a new memtable and trigger compaction of old
slfritchie commented 10 years ago

If I inject faults only into pread(2) at a 7%, I can get the same hang inside leveldb::DBImpl::MakeRoomForWrite. Ditto for only ftruncate(2) at 25% random rate. Using 50% rate for rename(2) and unlink(2) and unlinkat(2) seem to have no effect (or waiting 5 minutes isn't long enough).

slfritchie commented 10 years ago

Ditto for only open(2) at 10% random rate with errno=ENOSPC.

Failing access(2) and stat(2) and flock(2) and fcntl(2) at 50% appears to have no effect.

slfritchie commented 10 years ago

To recreate on OS X:

mkdir /tmp/e
cd /tmp/e
git clone git@github.com:basho/eleveldb.git
cd eleveldb
git checkout -f 850668ffce63e96d3ee1e4350c1a994711611fae
make ; rebar skip_deps=true eunit suites=NONE ; erlc -o deps/faulterl/ebin -I deps/faulterl/include priv/scenario/*erl ; deps/faulterl/ebin/make_intercept_c.escript trigger_commonpaths yo ; env `deps/faulterl/ebin/example_environment.sh $PWD/yo` erl -pz .eunit deps/*/ebin

Then:

eqc:quickcheck(eqc:testing_time(1, generic_qc_fsm:prop(false, false))). 
eqc:quickcheck(eqc:testing_time(3*60, generic_qc_fsm:prop(true, false))).

The string <f marks the start of a filler operation and > marks its completion. If you don't see a completion, and disk I/O rates have dropped to zero, you're stuck.

matthewvon commented 10 years ago

Did you wait a full 60 seconds to see if errors unblocked? Level db holds the failed compaction thread for60 seconds. That holds writes on conditions too.

slfritchie commented 10 years ago

Wow, that's a cool feature. Yes, I'm pretty certain that I've gone to lunch and hand the hang remain when I returned. I haven't yet learned fully to eat at Tokyo Speed(™), so I believe that I've exceeded the 60 second wow-that's-cool-latency-waiting-for-the-referee-to-ring-the-bell-for-the-next-round period.

slfritchie commented 10 years ago

Yup, more than 60 seconds.

2> time().
{20,12,21}

8:12:21 pm

3> eqc:quickcheck(eqc:testing_time(3*60, generic_qc_fsm:prop(true, false))).
.{.{<f (hang here......)
(press control-g)

User switch command
 --> s
 --> c

Eshell V5.10.3  (abort with ^G)
1> time().
{20,21,26}

8:21:26pm

matthewvon commented 10 years ago

leveldb branch mv-imm-retry

matthewvon commented 10 years ago

https://github.com/basho/leveldb/pull/129

slfritchie commented 10 years ago

+1, many thanks.