Open walkjivefly opened 1 year ago
After a little more digging, one of the instances reported in https://github.com/Crowndev/crown/issues/139#issuecomment-1290178062 also followed a NodeMinter occurrence but I didn't save/upload that full debug log.
The node in this report is still stuck in NodeMinter after 30 minutes and has not said anything else which is different to the one mentioned just above.
After the node had been hung for over an hour I collected a coredump and killed it. The dump may be missing vital information since
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f94332f5868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ffd66fc3d50, rem=0x7ffd66fc3d50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78 ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) gcore
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.140589
(gdb) quit
A debugging session is active.
Inferior 1 [process 140589] will be detached.
Quit anyway? (y or n) y
Detaching from program: /usr/local/bin/crownd, process 140589
[Inferior 1 (process 140589) detached]
but here it is issue195.core.zip
The other node I restarted this morning has done the same thing. debug log there ends with
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_DUP OP_HASH160 d0acb48682ec3966a40f637594ac35c360f0a93f OP_EQUALVERIFY OP_CHECKSIG vs OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7
2022-11-10T09:13:20Z IsBudgetCollateralValid : OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7 vs OP_RETURN 8a20957c1b6b342ea1e783cb71806f5f17f0f33c2192e7e1b531398f1729eae7
2022-11-10T09:14:12Z NodeMinter: Attempting to stake..
and gdb says
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f8e8d307868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7ffee85d4d20, rem=0x7ffee85d4d20) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
78 ../sysdeps/unix/sysv/linux/clock_nanosleep.c: No such file or directory.
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f8e8d21d780 (LWP 150153) "crownd" 0x00007f8e8d307868 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
req=0x7ffee85d4d20, rem=0x7ffee85d4d20)
at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
2 Thread 0x7f8e8ade9640 (LWP 150154) "b-scriptch.0" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
futex_word=0x55fe07b6f1b0 <scriptcheckqueue+80>)
at ./nptl/futex-internal.c:57
3 Thread 0x7f8e8a5e8640 (LWP 150155) "b-crown-minter" futex_wait (
private=0, expected=2, futex_word=0x55fe07b704f0 <budget+48>)
at ../sysdeps/nptl/futex-internal.h:146
4 Thread 0x7f8e89de7640 (LWP 150156) "b-http" 0x00007f8e8d347fde in epoll_wait (epfd=5, events=0x55fe08b66420, maxevents=32, timeout=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
5 Thread 0x7f8e895e6640 (LWP 150157) "b-httpworker.0" futex_wait (
private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
at ../sysdeps/nptl/futex-internal.h:146
6 Thread 0x7f8e88de5640 (LWP 150158) "b-httpworker.1" futex_wait (
private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
at ../sysdeps/nptl/futex-internal.h:146
7 Thread 0x7f8e7bfff640 (LWP 150159) "b-httpworker.2" futex_wait (
private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
--Type <RET> for more, q to quit, c to continue without paging--
at ../sysdeps/nptl/futex-internal.h:146
8 Thread 0x7f8e7b7fe640 (LWP 150160) "b-httpworker.3" futex_wait (
private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
at ../sysdeps/nptl/futex-internal.h:146
9 Thread 0x7f8e6ade9640 (LWP 150167) "b-torcontrol" 0x00007f8e8d347fde in epoll_wait (epfd=31, events=0x55fe0b3390f0, maxevents=32, timeout=2100000)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
10 Thread 0x7f8e6a5e8640 (LWP 150168) "b-net" futex_wait (
private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
at ../sysdeps/nptl/futex-internal.h:146
11 Thread 0x7f8e695e6640 (LWP 150170) "b-addcon" futex_wait (
private=0, expected=2, futex_word=0x55fe07b63300 <cs_main>)
at ../sysdeps/nptl/futex-internal.h:146
12 Thread 0x7f8e68de5640 (LWP 150171) "b-mncon" __futex_abstimed_wait_common64 (private=129455864, cancel=true, abstime=0x7f8e68de4b80, op=137,
expected=0, futex_word=0x7f8e88560540) at ./nptl/futex-internal.c:57
13 Thread 0x7f8e5ffff640 (LWP 150172) "b-opencon" __futex_abstimed_wait_common64 (private=1797285236, cancel=true, abstime=0x7f8e5fffe570, op=137,
expected=0, futex_word=0x7f8e88560540) at ./nptl/futex-internal.c:57
14 Thread 0x7f8e5f7fe640 (LWP 150173) "b-msghand" ___pthread_mutex_lock (mutex=<optimized out>) at ./nptl/pthread_mutex_lock.c:131
15 Thread 0x7f8e5effd640 (LWP 150174) "b-loadblk" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
--Type <RET> for more, q to quit, c to continue without paging--
futex_word=0x55fe07b7623c <leveldb::Env::Default()::env_container+92>)
at ./nptl/futex-internal.c:57
(gdb)
(gdb) gcore
warning: Memory read failed for corefile section, 4096 bytes at 0xffffffffff600000.
Saved corefile core.150153
(gdb)
Here's the debug.log and core issue195node2.zip
Another instance debug.log and core here issue195-20221110.zip
Had another one but the core is too large to upload. Will get it into nextcloud asap. here
Had 2 more occurrences since this morning on 2 different servers (same host).
And another one.
The situation seems to have improved a lot with v0.0.0.70 and v0.0.0.72; I've had 3 nodes stay up for almost 70 hours. Unfortunately, the 4th hung in NodeMinter and 1201s timed out its peers about 4 hours ago.
gdb process state looks the same as before. debug log and coredump here: issue195-20221114.zip
Had 2 more instances yesterday on different nodes.
2 more instances overnight on different nodes.
Had 1 more instance yesterday (in 0.0.0.77).
This one has always been a priority, will take apart the entire process and try to fix.
MN hung in NodeMinter. It had been stuck there for 7 minutes when this log was collected.
This appears to be same symptoms, different root cause as #139 because there is no sign in the log for that issue (https://github.com/Crowndev/crown/files/9946603/issue168or139.log) of the node attempting to mint before hitting the timeouts. 1201.zip