Open alex-core-0x opened 3 years ago
@Jaguar0625 @gimer @Wayonb is this one catapult?
@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?
@alex-core-0x The broker requires MongoDB to be running first. Do you know why the MongoDB is not running? Do you have the MongoDB logs?
@Wayonb
The server lags hugely (for some reason) when nodes are trying to recover. The resources of CPU, RAM, disk are more than enough.
MongoDB lags (see the log below with slow query
) -> node crashes while querying to it (seems like timeout) -> node is dead.
{"t":{"$date":"2021-07-20T02:53:08.803+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":152}}
{"t":{"$date":"2021-07-20T02:53:58.471+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.finalizedBlocks","command":{"find":"finalizedBlocks","filter":{},"sort":{"block.height":-1},"projection":{"_id":0},"limit":1,"singleBatch":true,"batchSize":1,"returnKey":false,"showRecordId":false,"lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"IXSCAN { block.height: -1 }","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"queryHash":"E6B6F0C5","planCacheKey":"E6B6F0C5","reslen":238,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":107}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn26","msg":"Slow query","attr":{"type":"command","ns":"catapult.blocks","command":{"count":"blocks","lsid":{"id":{"$uuid":"2fcc8a2e-1b89-43bd-ad2d-d9cca9e64cdf"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.664+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn23","msg":"Slow query","attr":{"type":"command","ns":"catapult.transactions","command":{"count":"transactions","lsid":{"id":{"$uuid":"666bae97-66c8-4170-8866-625abffa688a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
{"t":{"$date":"2021-07-20T03:25:04.841+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn25","msg":"Slow query","attr":{"type":"command","ns":"catapult.accounts","command":{"count":"accounts","lsid":{"id":{"$uuid":"e8ae4459-9cda-45b2-8098-f7003ae1a43a"}},"$db":"catapult"},"planSummary":"RECORD_STORE_FAST_COUNT","keysExamined":0,"docsExamined":0,"numYields":0,"reslen":45,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":696}}
Here are logs below, that list frequently.
2021-07-20 06:33:52.927090 0x00007f40e3fff700:
@alex-core-0x Thanks for the information above. Its not enough information to determine whats going on. Is it possible to get full logs?
The slow operation detected:
should not cause the issue with the mongodb.
What is your node resources setup(CPU, RAM)?
Do you know when the mongodb crashes? i.e. what going on the server at the time.
There's an issue with the Symbol node. If the broker can't connect to MongoDB (for some reason), then it crashes the node.
The only way after that - is to make resetData. Broker and node try to recover, but it doesn't work for them.