fboucquez / symbol-bootstrap

A tool that allows you to quickly configure and setup Symbol testnets and nodes.
Apache License 2.0
47 stars 27 forks source link

Node crashes when broker can't connect to MondoDB #279

Open alex-core-0x opened 3 years ago

alex-core-0x commented 3 years ago

There's an issue with the Symbol node. If the broker can't connect to MongoDB (for some reason), then it crashes the node.

2021-07-15 07:30:27.362775 0x00007f609d2da400: <info> (plugins::PluginModule.cpp@83) loading plugin from /usr/catapult/lib/libcatapult.mongo.plugins.accountlink.so
2021-07-15 07:30:58.034710 0x00007f609d2da400: <error> (extensions::ProcessBootstrapper.cpp@103) Throwing exception: Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::exception_detail::current_exception_std_exception_wrapper<std::runtime_error> >
std::exception::what: No suitable servers found: `serverSelectionTimeoutMS` expired: [Failed to connect to target host: db:27017]: generic server error
mongocxx::v_noabi::bulk_write_exception
2021-07-15 07:30:58.034785 0x00007f609d2da400: <fatal> (process::ProcessMain.cpp@75)
thread: broker catapult
unhandled exception while running local node!
/catapult-src/src/catapult/extensions/ProcessBootstrapper.cpp(103): Throw in function void catapult::extensions::{anonymous}::LoadExtension(const catapult::plugins::PluginModule&, catapult::extensions::ProcessBootstrapper&)
Dynamic exception type: boost::wrapexcept<catapult::plugins::plugin_load_error>
std::exception::what: Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::exception_detail::current_exception_std_exception_wrapper<std::runtime_error> >
std::exception::what: No suitable servers found: `serverSelectionTimeoutMS` expired: [Failed to connect to target host: db:27017]: generic server error
mongocxx::v_noabi::bulk_write_exception

/symbol-commands/start.sh: line 126:    11 Aborted                 (core dumped) $catapultAppFolder/bin/$processName "$config"

The only way after that - is to make resetData. Broker and node try to recover, but it doesn't work for them.

fboucquez commented 3 years ago

@Jaguar0625 @gimer @Wayonb is this one catapult?

Wayonb commented 3 years ago

@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?

alex-core-0x commented 3 years ago

@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?

@Wayonb

The server lags hugely (for some reason) when nodes are trying to recover. The resources of CPU, RAM, disk are more than enough.

MongoDB lags (see the log below with slow query) -> node crashes while querying to it (seems like timeout) -> node is dead.

{"t":{"$date":"2021-07-20T02:53:08.803+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":152}}
{"t":{"$date":"2021-07-20T02:53:58.471+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":107}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.blocks","command":{"count":"blocks","lsid":{"id":{"$uuid":"2fcc8a2e-1b89-43bd-ad2d-d9cca9e64cdf"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn23","msg":"Slow query","attr":{"type":"command","ns":"catapult.transactions","command":{"count":"transactions","lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.841+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn25","msg":"Slow query","attr":{"type":"command","ns":"catapult.accounts","command":{"count":"accounts","lsid":{"id":{"$uuid":"e8ae4459-9cda-45b2-8098-f7003ae1a43a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}

Here are logs below, that list frequently.

2021-07-20 06:33:52.927090 0x00007f40e3fff700: (utils::StackLogger.h@77) slow operation detected: 'BlockChainSyncConsumer::commitAll' (5390ms)

Wayonb commented 3 years ago

@alex-core-0x Thanks for the information above. Its not enough information to determine whats going on. Is it possible to get full logs? The slow operation detected: should not cause the issue with the mongodb. What is your node resources setup(CPU, RAM)?
Do you know when the mongodb crashes? i.e. what going on the server at the time.