AntelopeIO / leap

C++ implementation of the Antelope protocol
Other
116 stars 68 forks source link

Crash before integrity hash on stop #1872

Open ericpassmore opened 1 year ago

ericpassmore commented 1 year ago

Nodoes terminates before integrity hash is reported. The command line reports segmentation fault(core dump) but I'm unable to find the core dump on the host.

Running nodeos 5.0.0-rc2 , the config will sync with peers and terminate at end block. Snapshot starts at block num 323784127 and end block is 323956925. sync-config.ini may be found here. Snapshot is v6 from eos nation.

nodeos   --snapshot "${NODEOS_DIR}"/snapshot/snapshot.bin  \
--data-dir "${NODEOS_DIR}"/data/      \
--config "${CONFIG_DIR}"/sync-config.ini      \
--terminate-at-block ${END_BLOCK}      \
--integrity-hash-on-start      \
--integrity-hash-on-stop

Last lines of the log file

info  2023-11-07T06:02:05.463 net-0     net_plugin.cpp:2253           recv_handshake       ] ["eosn-eos-seed171:9876 - 47a4e43" - 1 209.249.216.152:9876] handshake lib 340293504, head 340293840, head id 63c09107414eba99.. sync 1, head 323956819, lib 323956493
info  2023-11-07T06:02:05.463 net-0     net_plugin.cpp:1572           operator()           ] ["eosn-eos-seed171:9876 - 47a4e43" - 1 209.249.216.152:9876] Sending handshake generation 1430, lib 323956493, head 323956819, id a3dcd867d009105e
info  2023-11-07T06:02:07.667 nodeos    producer_plugin.cpp:1754      start_block          ] Reached configured maximum block 323956925; terminating
error 2023-11-07T06:02:07.667 nodeos    producer_plugin.cpp:2459      schedule_production_ ] Failed to start a pending block, will try again later
info  2023-11-07T06:02:07.667 net-0     net_plugin.cpp:1453           _close               ] ["peer1.eosphere.io:9876 - 48fdafa" - 3 198.50.156.24:9876] closing
info  2023-11-07T06:02:07.676 nodeos    resource_monitor_plugi:140    plugin_shutdown      ] shutdown...
info  2023-11-07T06:02:07.676 nodeos    resource_monitor_plugi:147    plugin_shutdown      ] exit shutdown
info  2023-11-07T06:02:07.676 nodeos    net_plugin.cpp:4390           plugin_shutdown      ] shutdown..
info  2023-11-07T06:02:07.676 nodeos    net_plugin.cpp:4578           close_all            ] close all 4 connections
info  2023-11-07T06:02:07.677 nodeos    net_plugin.cpp:4394           plugin_shutdown      ] exit shutdown
info  2023-11-07T06:02:07.677 nodeos    producer_plugin.cpp:1400      plugin_shutdown      ] exit shutdown
info  2023-11-07T06:02:07.677 nodeos    http_plugin.cpp:515           plugin_shutdown      ] exit shutdown

Next steps, going to run this with a much smaller sync span, and try with nodeos v4.0.4.

ericpassmore commented 1 year ago

No luck with other options. Testing with smaller sync span, and tested on 4.0.4. Still crashing without logging ending integrity hash.

heifner commented 1 year ago

I tried this with release/5.0 using your config.ini and it didn't segfault.

ericpassmore commented 1 year ago

Closing this issue. First I am unable to get a core dump. After changing ulimit its reporting segmentation fault, and no core dump. Second due to lack of root privileges I have unpacked nodoes locally by expanding the deb. Therefore this isn't a standard install.

ericpassmore commented 11 months ago

reopening to continue investigation, will attach a debugger.

ericpassmore commented 11 months ago

Reliably getting errors with 5.0.0-rc2. Attached gdb and included a stack trace from the treads at the time of nodeos crash.

Leap 5.0.0-rc2 Stack-Trce-Stop-Hash-Failure.txt Leap 5.0.0-rc2 Nodes-Log-Stop-Hash-Failure.log

For Comparison here is the nodeos log lines without the --integrity-hash-on-stop option, and a successful exit. Leap 5.0.0-rc2 OK-Log-Without-Stop-Hash-Option.log

ericpassmore commented 11 months ago

Stack trace from Leap 5.0.0-rc3 slightly different Leap 5.0.0-rc3 Stack-Trce-Stop-Hash-Failure.txt

ericpassmore commented 11 months ago

Stack trace from Leap 4.0.5. Nodeos-4.0.5-Stack-Trce-Stop-Hash-Failure.txt

heifner commented 11 months ago

clear_expired_input_transactions is calling a method on controller, but controller is being destroyed at this point. Seems like we should move the calculate_integrity_hash() out of the ~controller_impl() destructor and into chain_plugin shutdown.

greg7mdp commented 11 months ago

clear_expired_input_transactions is calling a method on controller, but controller is being destroyed at this point. Seems like we should move the calculate_integrity_hash() out of the ~controller_impl() destructor and into chain_plugin shutdown.

Or maybe we add a bool parameter to add_to_snapshot(.., at_shutdown = false) so that when called at shutdown we don't call clear_expired_input_transactions?

It is kinda weird that the fact this log statement: ilog( "chain database stopped with hash: ${hash}", ("hash", calculate_integrity_hash()) ); in ~controller_impl() has the side effect of writing to the snapshot and clearing expired transactions.

heifner commented 11 months ago

clear_expired_input_transactions is required.

Yeah, if you enable warn or error level logging then the integrity hash generation on start/stop doesn't work. We should fix that also.

bhazzard commented 11 months ago

Summary: The issue comes into play when outputting the integrity hash on start and stop...

--integrity-hash-on-start   
--integrity-hash-on-stop

It needs to be fixed in affected versions, and the feature was added in 3.2.

Fix in 3.2.6, 4.0.6, and 5.0.x, and main/6

High priority for 5.0 because it would block chicken dance.

It should be fixed in 3.2.x and 4.0.x, but won't on its own necessitate a release.

ericpassmore commented 11 months ago

Chicken dance is not blocked. There is a work around already in place.

spoonincode commented 9 months ago

fwiw so far I have been unable to reproduce the crash nor can I get any tooling to report an invalid access. That's somewhat a problem in creating a test case to show the problem is fixed. Wonder what I'm missing.

clear_expired_input_transactions is calling a method on controller, but controller is being destroyed at this point.

controller or controller_impl? The only actions that have been taken on controller_impl at that time are, https://github.com/AntelopeIO/leap/blob/f7d7d2ff188cde76d13e1af2d02c7027706956cc/libraries/chain/controller.cpp#L767-L773 and pending.reset(); is pretty much a noop since it was just done anyways, https://github.com/AntelopeIO/leap/blob/f7d7d2ff188cde76d13e1af2d02c7027706956cc/libraries/chain/controller.cpp#L2769-L2770 I'm not immediately seeing anything in clear_expired_input_transactions() that accesses something now invalid? All the members of controller_impl are still intact other then pending which is well handled as part of is_building_block().

Yeah, if you enable warn or error level logging then the integrity hash generation on start/stop doesn't work. We should fix that also.

What are you suggesting? There isn't really anything to do other then log it.

spoonincode commented 9 months ago

What are you suggesting? There isn't really anything to do other then log it.

Maybe it gets its own non-default logger?

heifner commented 9 months ago

self in controller_impl::clear_expired_input_transactions is controller& maybe that is not valid at this point.

What are you suggesting? There isn't really anything to do other then log it.

Maybe it gets its own non-default logger?

I mean we need this:

      if(okay_to_print_integrity_hash_on_stop && conf.integrity_hash_on_stop) {
         auto hash = calculate_integrity_hash();
         fc::logger::get(DEFAULT_LOGGER)).log( FC_LOG_MESSAGE( info, "chain database stopped with hash: ${hash}", ("hash", hash) ) );
      }

Otherwise if a user has default logger configured for warn or error they will not get output even if they ask for --integrity-hash-on-stop.

ericpassmore commented 9 months ago

Note:start group: STABILITY category: BUG summary: Fixes bug and prints final hash when specifying the option --integrity-hash-on-stop. Previously a race condition in the shutdown logic prevented the hash from being printed. Note:end