Grinnode-live / 2020-grin-bug-bash-challenge

Finding bugs in Grin-Wallet & Grin-nodes for a bounty prior to Grin fork v5.
3 stars 1 forks source link

rewind_single_block causes node to become unresponsive #21

Open bladedoyle opened 3 years ago

bladedoyle commented 3 years ago

Some causes of rewind_single_block were fixed before 5.0.beta2 but the issue remains.

Issue: The node enters a loop of "rewind_single_block" and becomes unresponsive to API and fails to sync new blocks.

Testcase: Start with an archive node Call the nodes "get_block" API to fetch each block from 0 to tip. Observe the node logs for message like 20201213 16:47:41.127 WARN grin_chain::txhashset::txhashset - rewind_single_block: 3 output_pos entries missing for: 000251721c7d at 781244

Expected Result: no api timeouts node should not fall behind in sync

Actual Results: Observe that the node gives api timeouts while that rewind event is happening Observe that immediately after the rewind event concludes the node reports being out of sync:

...
...
20201213 16:47:41.125 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 00025c08d11f at 781245
20201213 16:47:41.127 WARN grin_chain::txhashset::txhashset - rewind_single_block: 3 output_pos entries missing for: 000251721c7d at 781244
20201213 16:47:45.636 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 1731290747276436, peer_difficulty 1731293920230257, threshold 1212787626 (last 5 blocks), enabling sync
20201213 16:47:48.554 INFO grin_servers::grin::sync::syncer - synchronized at 1731293920230257 @ 1000652 [000393625bde]

Some additional information

While the node is unresponsive:

1) its doing a lot of disk IO. From iotop:

Current DISK READ:      23.92 M/s | Current DISK WRITE:      42.91 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 288633 be/4 ubuntu     22.27 M/s    0.00 B/s  0.00 % 97.18 % grin [peer_read]

2) it reports dozens or hundreds of "no peers" messages

20201214 22:49:22.576 WARN grin_servers::grin::sync::syncer - sync: no peers available, disabling sync
20201214 22:49:52.942 WARN grin_servers::grin::sync::syncer - sync: no peers available, disabling sync
20201214 22:50:01.959 WARN grin_servers::grin::sync::syncer - sync: no peers available, disabling sync

3) The API accepts connections but does not reply after that:

urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=3413): Read timed out. (read timeout=20)

4) After 5 or 10 minutes there is a flood of "rewind_single_block" messages

20201214 22:59:44.788 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 0002ac2a7bd7 at 986478
20201214 22:59:44.788 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 0005580fd69e at 986477
20201214 22:59:44.809 WARN grin_chain::txhashset::txhashset - rewind_single_block: 1 output_pos entries missing for: 0005048edfd7 at 986474
...

It appears that this is the thread and stack trace for the event:

(gdb) thread 69
[Switching to thread 69 (Thread 0x7e9d675e6700 (LWP 288633))]
#0  0x00005648d2915664 in mdb_page_search_root ()
(gdb) where
#0  0x00005648d2915664 in mdb_page_search_root ()
#1  0x00005648d2915975 in mdb_page_search ()
#2  0x00005648d29160b2 in mdb_cursor_set ()
#3  0x00005648d29166e4 in mdb_get ()
#4  0x00005648d28d970b in grin_store::lmdb::Batch::get_ser ()
#5  0x00005648d28e9f0f in grin_chain::store::Batch::get_block_header ()
#6  0x00005648d28f89b5 in grin_chain::pipe::rewind_and_apply_fork ()
#7  0x00005648d28678c5 in grin_chain::txhashset::txhashset::extending_readonly ()
#8  0x00005648d28bd498 in grin_chain::chain::Chain::convert_block_v2 ()
#9  0x00005648d2462817 in <grin_servers::common::adapters::NetToChainAdapter<B,P,V> as grin_p2p::types::ChainAdapter>::get_block ()
#10 0x00005648d28595ab in <grin_p2p::peers::Peers as grin_p2p::types::ChainAdapter>::get_block ()
#11 0x00005648d280986e in <grin_p2p::peer::TrackingAdapter as grin_p2p::types::ChainAdapter>::get_block ()
#12 0x00005648d2837a15 in <grin_p2p::protocol::Protocol as grin_p2p::conn::MessageHandler>::consume ()
#13 0x00005648d282a726 in grin_p2p::conn::poll::{{closure}} ()
#14 0x00005648d2825026 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#15 0x00005648d280a446 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#16 0x00005648d2ac03aa in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
#17 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4/library/alloc/src/boxed.rs:1042
#18 std::sys::unix::thread::Thread::new::thread_start () at library/std/src/sys/unix/thread.rs:87
#19 0x00007f1d6d679609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x00007f1d6d911293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)