Crowndev / crown

Crown (formerly Crowncoin) reference implementation
MIT License
7 stars 6 forks source link

v0.0.0.69 MN hung in NodeMinter #195

Open walkjivefly opened 1 year ago

walkjivefly commented 1 year ago

MN hung in NodeMinter. It had been stuck there for 7 minutes when this log was collected.

crown@crown-testnet-01:~/.crown/testnet3$ tail -f debug.log
2022-11-10T09:13:20Z CBudgetManager::CheckAndRemove - pbudgetDraft->IsValid - strError: Cannot find masternode : CTxIn(COutPoint(80215524545a2a0b3f7465a7dabed37db5fc9a57a1909e87a5041d1a6e5931ea, 1), scriptSig=)
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_DUP OP_HASH160 fe5190b4bb9ecff0103f1ef35be560405f361034 OP_EQUALVERIFY OP_CHECKSIG vs OP_RETURN 13eec834696dbe23568aad4aa11007a5d06940e2b62bb48bcd04bc057f90040e
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_RETURN 13eec834696dbe23568aad4aa11007a5d06940e2b62bb48bcd04bc057f90040e vs OP_RETURN 13eec834696dbe23568aad4aa11007a5d06940e2b62bb48bcd04bc057f90040e
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_DUP OP_HASH160 cf654d65da5b488cae635dcac0832060669b0dd7 OP_EQUALVERIFY OP_CHECKSIG vs OP_RETURN 6feba6c65c56db0a8e1484353585e65082fed0cf51d602c8f9495ca5fa9036fc
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_RETURN 6feba6c65c56db0a8e1484353585e65082fed0cf51d602c8f9495ca5fa9036fc vs OP_RETURN 6feba6c65c56db0a8e1484353585e65082fed0cf51d602c8f9495ca5fa9036fc
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_DUP OP_HASH160 cf654d65da5b488cae635dcac0832060669b0dd7 OP_EQUALVERIFY OP_CHECKSIG vs OP_RETURN 8058bf13702fee4605918a809bfd8d9698217cbcd63f9bacf885aa98681166be
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_RETURN 6feba6c65c56db0a8e1484353585e65082fed0cf51d602c8f9495ca5fa9036fc vs OP_RETURN 8058bf13702fee4605918a809bfd8d9698217cbcd63f9bacf885aa98681166be
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_DUP OP_HASH160 d0acb48682ec3966a40f637594ac35c360f0a93f OP_EQUALVERIFY OP_CHECKSIG vs OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7 vs OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7
2022-11-10T09:14:14Z NodeMinter: Attempting to stake..
^C
crown@crown-testnet-01:~/.crown/testnet3$ pidof crownd
140589
crown@crown-testnet-01:~/.crown/testnet3$ sudo gdb /usr/local/bin/crownd 140589
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/bin/crownd...
(No debugging symbols found in /usr/local/bin/crownd)
Attaching to program: /usr/local/bin/crownd, process 140589
[New LWP 140590]
[New LWP 140591]
[New LWP 140592]
[New LWP 140593]
[New LWP 140594]
[New LWP 140595]
[New LWP 140596]
[New LWP 140597]
[New LWP 140600]
[New LWP 140601]
[New LWP 140603]
[New LWP 140604]
[New LWP 140605]
[New LWP 140606]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f94332f5868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ffd66fc3d50, rem=0x7ffd66fc3d50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78  ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                           Frame 
* 1    Thread 0x7f943320b780 (LWP 140589) "crownd"         0x00007f94332f5868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=0x7ffd66fc3d50, rem=0x7ffd66fc3d50)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  2    Thread 0x7f9430dd7640 (LWP 140590) "b-scriptch.0"   __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x5578b80501b4 <scriptcheckqueue+84>)
    at ./nptl/futex-internal.c:57
  3    Thread 0x7f94305d6640 (LWP 140591) "b-crown-minter" futex_wait (
    private=0, expected=2, futex_word=0x5578b80514f0 <budget+48>)
    at ../sysdeps/nptl/futex-internal.h:146
  4    Thread 0x7f942fdd5640 (LWP 140592) "b-http"         0x00007f9433335fde in epoll_wait (epfd=5, events=0x5578b8a066a0, maxevents=32, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f942f5d4640 (LWP 140593) "b-httpworker.0" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x5578b89816c0) at ./nptl/futex-internal.c:57
  6    Thread 0x7f942edd3640 (LWP 140594) "b-httpworker.1" futex_wait (
    private=0, expected=2, futex_word=0x5578b8044300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  7    Thread 0x7f942e5d2640 (LWP 140595) "b-httpworker.2" futex_wait (
    private=0, expected=2, futex_word=0x5578b8044300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  8    Thread 0x7f942ddd1640 (LWP 140596) "b-httpworker.3" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x5578b89816c4) at ./nptl/futex-internal.c:57
  9    Thread 0x7f942d459640 (LWP 140597) "crownd"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x5578b8057238 <leveldb::Env::Default()::env_container+88>)
    at ./nptl/futex-internal.c:57
  10   Thread 0x7f9408f9b640 (LWP 140600) "b-torcontrol"   0x00007f9433335fde in epoll_wait (epfd=35, events=0x5578bb47b040, maxevents=32, timeout=656840)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  11   Thread 0x7f9403fff640 (LWP 140601) "b-net"          futex_wait (
    private=0, expected=2, futex_word=0x5578b8044300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  12   Thread 0x7f9402ffd640 (LWP 140603) "b-addcon"       futex_wait (
    private=0, expected=2, futex_word=0x5578b8044300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  13   Thread 0x7f94027fc640 (LWP 140604) "b-mncon"        __futex_abstimed_wait_common64 (private=-1207605512, cancel=true, abstime=0x7f94027fbb80, op=137, 
    expected=0, futex_word=0x5578b8adcd80) at ./nptl/futex-internal.c:57
  14   Thread 0x7f9401ffb640 (LWP 140605) "b-opencon"      __futex_abstimed_wait_common64 (private=1797285236, cancel=true, abstime=0x7f9401ffa570, op=137, 
    expected=0, futex_word=0x5578b8adcd80) at ./nptl/futex-internal.c:57
  15   Thread 0x7f94017fa640 (LWP 140606) "b-msghand"      0x00007f94332a80d1 in lll_mutex_lock_optimized (mutex=0x5578b8048c80 <mnodeman>)
    at ./nptl/pthread_mutex_lock.c:48
(gdb) 

This appears to be same symptoms, different root cause as #139 because there is no sign in the log for that issue (https://github.com/Crowndev/crown/files/9946603/issue168or139.log) of the node attempting to mint before hitting the timeouts. 1201.zip

walkjivefly commented 1 year ago

After a little more digging, one of the instances reported in https://github.com/Crowndev/crown/issues/139#issuecomment-1290178062 also followed a NodeMinter occurrence but I didn't save/upload that full debug log.

The node in this report is still stuck in NodeMinter after 30 minutes and has not said anything else which is different to the one mentioned just above.

walkjivefly commented 1 year ago

After the node had been hung for over an hour I collected a coredump and killed it. The dump may be missing vital information since

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f94332f5868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ffd66fc3d50, rem=0x7ffd66fc3d50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78  ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) gcore
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.140589
(gdb) quit
A debugging session is active.

    Inferior 1 [process 140589] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/local/bin/crownd, process 140589
[Inferior 1 (process 140589) detached]

but here it is issue195.core.zip

walkjivefly commented 1 year ago

The other node I restarted this morning has done the same thing. debug log there ends with

2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_DUP OP_HASH160 d0acb48682ec3966a40f637594ac35c360f0a93f OP_EQUALVERIFY OP_CHECKSIG vs OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7 vs OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7
2022-11-10T09:14:12Z NodeMinter: Attempting to stake..

and gdb says

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f8e8d307868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ffee85d4d20, rem=0x7ffee85d4d20) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78  ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                           Frame 
* 1    Thread 0x7f8e8d21d780 (LWP 150153) "crownd"         0x00007f8e8d307868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, 
    req=0x7ffee85d4d20, rem=0x7ffee85d4d20)
    at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  2    Thread 0x7f8e8ade9640 (LWP 150154) "b-scriptch.0"   __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
    futex_word=0x55fe07b6f1b0 <scriptcheckqueue+80>)
    at ./nptl/futex-internal.c:57
  3    Thread 0x7f8e8a5e8640 (LWP 150155) "b-crown-minter" futex_wait (
    private=0, expected=2, futex_word=0x55fe07b704f0 <budget+48>)
    at ../sysdeps/nptl/futex-internal.h:146
  4    Thread 0x7f8e89de7640 (LWP 150156) "b-http"         0x00007f8e8d347fde in epoll_wait (epfd=5, events=0x55fe08b66420, maxevents=32, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7f8e895e6640 (LWP 150157) "b-httpworker.0" futex_wait (
    private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  6    Thread 0x7f8e88de5640 (LWP 150158) "b-httpworker.1" futex_wait (
    private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  7    Thread 0x7f8e7bfff640 (LWP 150159) "b-httpworker.2" futex_wait (
    private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
--Type <RET> for more, q to quit, c to continue without paging--
    at ../sysdeps/nptl/futex-internal.h:146
  8    Thread 0x7f8e7b7fe640 (LWP 150160) "b-httpworker.3" futex_wait (
    private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  9    Thread 0x7f8e6ade9640 (LWP 150167) "b-torcontrol"   0x00007f8e8d347fde in epoll_wait (epfd=31, events=0x55fe0b3390f0, maxevents=32, timeout=2100000)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  10   Thread 0x7f8e6a5e8640 (LWP 150168) "b-net"          futex_wait (
    private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  11   Thread 0x7f8e695e6640 (LWP 150170) "b-addcon"       futex_wait (
    private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
    at ../sysdeps/nptl/futex-internal.h:146
  12   Thread 0x7f8e68de5640 (LWP 150171) "b-mncon"        __futex_abstimed_wait_common64 (private=129455864, cancel=true, abstime=0x7f8e68de4b80, op=137, 
    expected=0, futex_word=0x7f8e88560540) at ./nptl/futex-internal.c:57
  13   Thread 0x7f8e5ffff640 (LWP 150172) "b-opencon"      __futex_abstimed_wait_common64 (private=1797285236, cancel=true, abstime=0x7f8e5fffe570, op=137, 
    expected=0, futex_word=0x7f8e88560540) at ./nptl/futex-internal.c:57
  14   Thread 0x7f8e5f7fe640 (LWP 150173) "b-msghand"      ___pthread_mutex_lock (mutex=<optimized out>) at ./nptl/pthread_mutex_lock.c:131
  15   Thread 0x7f8e5effd640 (LWP 150174) "b-loadblk"      __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, 
--Type <RET> for more, q to quit, c to continue without paging--
    futex_word=0x55fe07b7623c <leveldb::Env::Default()::env_container+92>)
    at ./nptl/futex-internal.c:57
(gdb) 
(gdb) gcore
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.150153
(gdb) 

Here's the debug.log and core issue195node2.zip

walkjivefly commented 1 year ago

Another instance debug.log and core here issue195-20221110.zip

walkjivefly commented 1 year ago

Had another one but the core is too large to upload. Will get it into nextcloud asap. here

walkjivefly commented 1 year ago

Had 2 more occurrences since this morning on 2 different servers (same host).

walkjivefly commented 1 year ago

And another one.

walkjivefly commented 1 year ago

The situation seems to have improved a lot with v0.0.0.70 and v0.0.0.72; I've had 3 nodes stay up for almost 70 hours. Unfortunately, the 4th hung in NodeMinter and 1201s timed out its peers about 4 hours ago.

gdb process state looks the same as before. debug log and coredump here: issue195-20221114.zip

walkjivefly commented 1 year ago

Had 2 more instances yesterday on different nodes.

walkjivefly commented 1 year ago

2 more instances overnight on different nodes.

walkjivefly commented 1 year ago

Had 1 more instance yesterday (in 0.0.0.77).

BlockMechanic commented 1 year ago

This one has always been a priority, will take apart the entire process and try to fix.