Witness_node randomly stops syncing

Bug Description

Start a new witness_node instance and wait, sometimes it hangs during syncing (unable to sync to latest block). Restarting works sometimes.

See https://github.com/bitshares/bitshares-core/issues/2798#issuecomment-1855022095 for more info.

~~Unable to stably reproduce so far. Haven't found very interesting info in log files yet. Probably caused by memory corruption.~~

Impacts Describe which portion(s) of BitShares Core may be impacted by this bug. Please tick at least one box.

[ ] API (the application programming interface)
[ ] Build (the build process or something prior to compiled code)
[ ] CLI (the command line wallet)
[ ] Deployment (the deployment process after building such as Docker, Travis, etc.)
[ ] DEX (the Decentralized EXchange, market engine, etc.)
[x] P2P (the peer-to-peer network for transaction/block propagation)
[x] Performance (system or user efficiency, etc.)
[ ] Protocol (the blockchain logic, consensus, validation, etc.)
[x] Security (the security of system or user data, etc.)
[x] UX (the User Experience)
[ ] Other (please add below)

Host Environment Please provide details about the host environment. Much of this information can be found running: witness_node --version.

Host OS: Ubuntu (various versions)
Host Physical RAM Sufficient
BitShares Version: 7.0.1 / test-7.0.3
OpenSSL Version: -
Boost Version: -

CORE TEAM TASK LIST

[ ] Evaluate / Prioritize Bug Report
[ ] Refine User Stories / Requirements
[ ] Define Test Cases
[ ] Design / Develop Solution
[ ] Perform QA/Testing
[ ] Update Documentation

Found the reason.

In PR https://github.com/bitshares/bitshares-core/pull/2791 for the last releases (7.0.1 and test-7.0.3), we updated read_write_handler and read_write_handler_with_buffer to throw canceled_exception when a boost::asio::error::operation_aborted error occurs.

The canceled_exception might then be caught in message_oriented_connection_impl::read_loop(): https://github.com/bitshares/bitshares-core/blob/0aa383154ce191d708f7d143b27662bc171ed239/libraries/net/message_oriented_connection.cpp#L215-L230

as a result, call_on_connection_closed is no longer set to true, so node::on_connection_closed() is not called as before, which means that certain cleanup steps are not performed.

bitshares / bitshares-core

Witness_node randomly stops syncing #2798

CORE TEAM TASK LIST