christianbundy commented 4 years ago

Hi all! Is this segfault coming from leveldown? Sorry for the quick message, just walking out the door. :runner: :dash:

Repro

git clone https://github.com/ssbc/ssb-tribes.git
cd ssb-tribes
npm install
node test/listen.test.js

Env

(Using the latest leveldown)

Our CI is showing this happening on Linux and macOS (not Windows), on Node.js 10, 12, and 14.

$ npm version
{
  npm: '6.14.6',
  ares: '1.16.1',
  brotli: '1.0.7',
  cldr: '37.0',
  http_parser: '2.9.3',
  icu: '67.1',
  llhttp: '2.0.4',
  modules: '72',
  napi: '6',
  nghttp2: '1.40.0',
  node: '12.17.0',
  openssl: '1.1.1g',
  tz: '2019c',
  unicode: '13.0',
  uv: '1.37.0',
  v8: '7.8.279.23-node.37',
  zlib: '1.2.11'
}

Core dump:

           PID: 176094 (node)
           UID: 1000 (christianbundy)
           GID: 1001 (sudo)
        Signal: 11 (SEGV)
     Timestamp: Thu 2020-07-09 14:40:27 PDT (4min 23s ago)
  Command Line: npm
    Executable: /usr/bin/node
 Control Group: /user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-e2533ae4-7c76-4d12-83f7-8c162e250655.scope
          Unit: user@1000.service
     User Unit: vte-spawn-e2533ae4-7c76-4d12-83f7-8c162e250655.scope
         Slice: user-1000.slice
     Owner UID: 1000 (christianbundy)
       Boot ID: cdbe3cc350f24532ab2abc863f66d2c4
    Machine ID: b35292cc98a343bfa82609e01cc14010
      Hostname: orion
       Storage: /var/lib/systemd/coredump/core.node.1000.cdbe3cc350f24532ab2abc863f66d2c4.176094.1594330827000000000000.lz4
       Message: Process 176094 (node) of user 1000 dumped core.

                Stack trace of thread 176094:
                #0  0x00007f1009f4472b kill (libc.so.6 + 0x3c72b)
                #1  0x0000561ccf49ca7a uv_kill (node + 0xf7aa7a)
                #2  0x0000561cceb38e6c n/a (node + 0x616e6c)
                #3  0x0000561cceca1d37 _ZN2v88internal25FunctionCallbackArguments4CallENS0_15CallHandlerInfoE (node + 0x77fd37)
                #4  0x0000561cceca20f6 n/a (node + 0x7800f6)
                #5  0x0000561cceca291a n/a (node + 0x78091a)
                #6  0x0000561cceca3256 _ZN2v88internal21Builtin_HandleApiCallEiPmPNS0_7IsolateE (node + 0x781256)
                #7  0x0000561ccf3dd2f9 n/a (node + 0xebb2f9)
                #8  0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #9  0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #10 0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #11 0x0000561ccf35c1fc n/a (node + 0xe3a1fc)
                #12 0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #13 0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #14 0x0000561ccf35c1fc n/a (node + 0xe3a1fc)
                #15 0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #16 0x0000561ccf362ba4 n/a (node + 0xe40ba4)
                #17 0x0000561ccf36011d n/a (node + 0xe3e11d)
                #18 0x0000561ccf35fef8 n/a (node + 0xe3def8)
                #19 0x0000561cced5fdef n/a (node + 0x83ddef)
                #20 0x0000561cced60457 _ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_ (node + 0x83e457)
                #21 0x0000561ccec55e72 _ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_ (node + 0x733e72)
                #22 0x0000561ccea22e1a _ZN4node20InternalMakeCallbackEPNS_11EnvironmentEN2v85LocalINS2_6ObjectEEES5_NS3_INS2_8FunctionEEEiPNS3_INS2_5ValueEEENS_13async_contextE (node + 0x500e1a)
                #23 0x0000561ccea33dc3 _ZN4node9AsyncWrap12MakeCallbackEN2v85LocalINS1_8FunctionEEEiPNS2_INS1_5ValueEEE (node + 0x511dc3)
                #24 0x0000561cceb90865 n/a (node + 0x66e865)
                #25 0x0000561ccf49bf17 n/a (node + 0xf79f17)
                #26 0x0000561ccf49d98c n/a (node + 0xf7b98c)
                #27 0x0000561ccf4a5c90 n/a (node + 0xf83c90)
                #28 0x0000561ccf493a97 uv_run (node + 0xf71a97)
                #29 0x0000561cceb07b66 _ZN4node16NodeMainInstance3RunEv (node + 0x5e5b66)
                #30 0x0000561ccea8c9bf _ZN4node5StartEiPPc (node + 0x56a9bf)
                #31 0x00007f1009f2f002 __libc_start_main (libc.so.6 + 0x27002)
                #32 0x0000561ccea209ae _start (node + 0x4fe9ae)

                Stack trace of thread 176098:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561cceb31eac n/a (node + 0x60feac)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176097:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561cceb31eac n/a (node + 0x60feac)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176099:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561cceb31eac n/a (node + 0x60feac)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176100:
                #0  0x00007f100a0e18f4 do_futex_wait.constprop.0 (libpthread.so.0 + 0x128f4)
                #1  0x00007f100a0e19f8 __new_sem_wait_slow.constprop.0 (libpthread.so.0 + 0x129f8)
                #2  0x0000561ccf4a167c uv_sem_wait (node + 0xf7f67c)
                #3  0x0000561ccebbb42f n/a (node + 0x69942f)
                #4  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #5  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176096:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561cceb31eac n/a (node + 0x60feac)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176102:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561ccf48f158 n/a (node + 0xf6d158)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176103:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561ccf48f158 n/a (node + 0xf6d158)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176104:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561ccf48f158 n/a (node + 0xf6d158)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176105:
                #0  0x00007f100a0dee32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                #1  0x0000561ccf4a197a uv_cond_wait (node + 0xf7f97a)
                #2  0x0000561ccf48f158 n/a (node + 0xf6d158)
                #3  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #4  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

                Stack trace of thread 176095:
                #0  0x00007f100a007f3e epoll_wait (libc.so.6 + 0xfff3e)
                #1  0x0000561ccf4a5ccc n/a (node + 0xf83ccc)
                #2  0x0000561ccf493a97 uv_run (node + 0xf71a97)
                #3  0x0000561cceb351c9 _ZN4node23WorkerThreadsTaskRunner20DelayedTaskScheduler3RunEv (node + 0x6131c9)
                #4  0x00007f100a0d8422 start_thread (libpthread.so.0 + 0x9422)
                #5  0x00007f100a007bf3 __clone (libc.so.6 + 0xffbf3)

christianbundy commented 4 years ago

I spent a few hours and compiled Node.js with debug symbols, and now I have a backtrace!

Program terminated with signal SIGSEGV, Segmentation fault.
#0  Database::Del (this=0x55e4a353f410, options=..., key=...) at ../binding.cc:399
399         return db_->Delete(options, key);
[Current thread is 1 (Thread 0x7ff265717700 (LWP 57861))]
(gdb) backtrace
#0  Database::Del (this=0x55e4a353f410, options=..., key=...) at ../binding.cc:399
#1  0x00007ff266f87512 in DelWorker::DoExecute (this=0x55e4a3550dc0) at ../binding.cc:1012
#2  0x00007ff266f85134 in BaseWorker::Execute (env=0x55e4a34b4990, data=0x55e4a3550dc0) at ../binding.cc:291
#3  0x000055e49e77fbeb in (anonymous namespace)::uvimpl::Work::DoThreadPoolWork (this=0x55e4a3544bd0) at ../src/node_api.cc:851
#4  0x000055e49e782d41 in node::ThreadPoolWork::ScheduleWork()::{lambda(uv_work_s*)#1}::operator()(uv_work_s*) const (__closure=0x0, req=0x55e4a3544c08) at ../src/threadpoolwork-inl.h:39
#5  0x000055e49e782d75 in node::ThreadPoolWork::ScheduleWork()::{lambda(uv_work_s*)#1}::_FUN(uv_work_s*) () at ../src/threadpoolwork-inl.h:40
#6  0x000055e49f77273b in uv__queue_work (w=0x55e4a3544c60) at ../deps/uv/src/threadpool.c:321
#7  0x000055e49f771e8a in worker (arg=0x0) at ../deps/uv/src/threadpool.c:122
#8  0x00007ff26e367422 in start_thread () from /usr/lib/libpthread.so.0
#9  0x00007ff26e296bf3 in clone () from /usr/lib/libc.so.6

My gdb knowledge is weak, so this is just the output of backtrace, please let me know if there's anything else I should try!

christianbundy commented 4 years ago

Weird, we were apparently calling db.close.bind(db)() instead of db.close()?! I'm having trouble reproducing the segfault with a minimal test-case but I can verify that removing the .bind() fixes it. :thinking:

Nope, still broken.

One of my problems is that I'm running a bunch of different instances of Leveldown, is there some way that I can add the location of the database to each printf? Here's what I have for debugging now:

diff --git a/binding.cc b/binding.cc
index e938a40..2de67d0 100644
--- a/binding.cc
+++ b/binding.cc
@@ -370,6 +370,8 @@ struct Database {

   leveldb::Status Open (const leveldb::Options& options,
                         const char* location) {
+    printf("Level(%s)->Open\n", location);
+
     return leveldb::DB::Open(options, location, &db_);
   }

@@ -385,17 +387,20 @@ struct Database {
   leveldb::Status Put (const leveldb::WriteOptions& options,
                        leveldb::Slice key,
                        leveldb::Slice value) {
+    printf("Level->Put\n");
     return db_->Put(options, key, value);
   }

   leveldb::Status Get (const leveldb::ReadOptions& options,
                        leveldb::Slice key,
                        std::string& value) {
+    printf("Level->Get\n");
     return db_->Get(options, key, &value);
   }

   leveldb::Status Del (const leveldb::WriteOptions& options,
                        leveldb::Slice key) {
+    printf("Level->Delete\n");
     return db_->Delete(options, key);
   }

@@ -840,6 +845,7 @@ struct CloseWorker final : public BaseWorker {
   ~CloseWorker () {}

   void DoExecute () override {
+    printf("Level->Close\n");
     database_->CloseDatabase();
   }
 };
diff --git a/deps/leveldb/leveldb-1.20/Makefile b/deps/leveldb/leveldb-1.20/Makefile
index f7cc7d7..a76bb75 100755
--- a/deps/leveldb/leveldb-1.20/Makefile
+++ b/deps/leveldb/leveldb-1.20/Makefile
@@ -7,9 +7,9 @@
 # to switch between compilation modes.

 # (A) Production use (optimized mode)
-OPT ?= -O2 -DNDEBUG
+# OPT ?= -O2 -DNDEBUG
 # (B) Debug mode, w/ full line-level debugging symbols
-# OPT ?= -g2
+OPT ?= -g2
 # (C) Profiling mode: opt, but w/debugging symbols
 # OPT ?= -O2 -g2 -DNDEBUG
 #-----------------------------------------------

The output:

Level->Get
Level->Close
Level->Delete
Level->Close
Level->Close
Level->Close
Level->Close
Level->Close
Level->Close
Level->Delete
Level->Close
Level->Delete

Unfortunately since I have a bunch of instances of Leveldown, I can't tell which one is misbehaving.

vweevers commented 4 years ago

Maybe do it on the JS side, here and/or in _batch():

https://github.com/Level/leveldown/blob/91711fa61d3e3666633a554c3ce59407c3b1c7dc/leveldown.js#L60-L62

Along the lines of:

if (this.status !== 'open') throw new Error(`Delete after close in ${this.location}`)

christianbundy commented 4 years ago

Good idea!

I've added some console.log() statements in leveldown.js too, here's the list of operations and then the error:

open
get
del
del
del
del
del
del
del
del
del
del
close
del
/home/christianbundy/src/leveldown/leveldown.js:62
    throw new Error(`Delete after close in ${this.location}`)
    ^

Error: Delete after close in /tmp/ssb-test-1594655235568-919/flume/query
    at LevelDOWN._del (/home/christianbundy/src/leveldown/leveldown.js:62:11)
    at /home/christianbundy/src/leveldown/node_modules/abstract-leveldown/abstract-leveldown.js:233:12
    at /home/christianbundy/src/leveldown/node_modules/abstract-leveldown/abstract-iterator.js:33:14
    at processTicksAndRejections (internal/process/task_queues.js:81:21)

vweevers commented 4 years ago

.status is a public property, so as a temporary workaround you could add that check to the responsible flume module.

christianbundy commented 4 years ago

Thanks! Would that work since we're using clear() rather than del()? My understanding is that we're calling clear() while the database is open and then closing it before the iterator finishes.

vweevers commented 4 years ago

That will work for now, because currently clear() is not optimized yet, it all happens in JS land, and uses _del() under the hood.

vweevers commented 4 years ago

Disregard that, it's an incomplete answer that misses the point. I have no time to give a good answer, unfortunately.

christianbundy commented 4 years ago

No problem -- I've got a workaround that's 100% fine for my use-case, I just wanted to make sure I was being a responsible citizen and reporting the segfault upstream. Sorry I haven't been able to figure out how to do a minimal repro, I've thrown a few hours into it but my repro code never segfaults. :upside_down_face:

I'll probably take a few more stabs at a minimal repro, but otherwise I'll just leave it in the back of my head. Thanks for your help so far.

reuzel commented 3 years ago

I'm also experiencing segmentation faults. Using the library in a node server app (tried v14, v15 and v16 with no difference), in an alpine (3.12) based docker container, running on Ubuntu 64-bit on a Raspberry Pi (4). The files it accesses are located in a mounted local directory. The segmentation fault seems to occur when the database is opened. My logs indicate that the fault occurs even before the first read! Tried to create a new empty folder. The folder structure get created LOG, LOCK, etc. but then it fails again. No problems whatsoever on my intel-based development machine...

vweevers commented 2 years ago

Closing because it's fixed in classic-level. On any operation it checks if the db is open and throws a JS error if not. In addition (but this was also already true for leveldown, since 6.0.3) it waits with closing the db if a clear() operation is in flight.

@reuzel I'm not sure about your case, sounds like a different issue (platform related). If you can reproduce it on classic-level feel free to open a new issue there.

Level / leveldown

Segmentation fault #728

Repro

Env

Core dump: