fboucquez / symbol-bootstrap

A tool that allows you to quickly configure and setup Symbol testnets and nodes.
Apache License 2.0
47 stars 27 forks source link

Detect and recover from stopped synchronizing #175

Open realgarp opened 3 years ago

realgarp commented 3 years ago

During the extended tests on the Testnet, some issues occurred on servers with restricted resources (RAM/CPU). The symptom was that synchronizing with the rest of the network stopped. At the same time "symbol-bootstrap healtchCheck" incorrectly indicated all was fine. See https://nem2.slack.com/archives/CF1KY4EJJ/p1613646168005400.

Stopping and restarting one of the servers did not result in a recovery. However, probably depending on how the synchronizing stopped and what the state of the db was at the time, "symbol-bootstrap healtchCheck" sometimes indicates the server was restarted while synchronizing did not continue.

Anyways, on all instances, log files were collected and looked at. Feedback was received from Wayon Blair: "The broker was timed out while trying to update the mongo database."

If such cases would occur, for whatever reason or cause:

First It would be better for the Symbol software itself to be able to deal with the restricted server resources and continue working or at least being able to recover from the situation when restarted.

Second, in case the software is not dealing with such a situation for whatever reason, symbol-bootstrap should be able to indicate there is an issue or a problem and indicate synchronizing has stopped to the node operator

Third, I would be best to indicate what could have caused the issue to occur and to make a suggestion about what action to take.

Thanks for considering... it would help future node owners.

fboucquez commented 3 years ago

Thanks @realgarp

1) The recovery issues would be related to https://github.com/nemtech/symbol-bootstrap/issues/108 which we are working on it.

2) Rest's health check validates the connection against the Mongo DB and the Server API. It doesn't check if the node is synchronizing or not. Full synchronization, especially on this test stressed testnet, takes a while. Let us think about how can rest/bootstrap/server tell us if the synchronization is happing. If you are a supernode owner, the node monitoring service will show you how's the status of your node.

3) This is case by case, if error X occurs, do Y. We cannot do generic recommendations without looking at the logs to analyze.

@Wayonb, what's your input?