Terminate process when kademlia shutdown, or other critical issues

tmm360 commented 2 years ago

Summary

Currently process is kept alive, but this doesn't permit to handle the error without an active check. Please also see this issue: https://github.com/ethersphere/bee/issues/2556

Motivation

Container orchestrators can try to restart the container automatically, but it has to terminate first. Any blocking issue should terminate the process, so IF the process is recoverable with a restart, the orchestrator can perform automatically. Otherwise, a script have to actively check if node is reachable, and eventually that it isn't try to kill and restart. Not much user friendly, it add complexity, and I don't see rationale under choose to not kill the process.

Implementation

Simply kill the process when a critical issue come on an active node.

Drawbacks

Don't see any.

mrekucci commented 2 years ago

@tmm360 thanks for the issue. Except the Kademlia shutdown, could you please elaborate more on what critical issues do you specifically encounter that you think the process should be terminated when they rise?

tmm360 commented 2 years ago

@mrekucci I don't see any currently, other than kademlia shutdown. But this is a specific case for a general behavior, in my opinion. Node always should be killed on critical errors, if it can't recover internally. Process should never enter in idle state

mrekucci commented 2 years ago

@mrekucci I don't see any currently, other than kademlia shutdown. But this is a specific case for a general behavior, in my opinion. Node always should be killed on critical errors, if it can't recover internally. Process should never enter in idle state

From your comments, I assume that your problem is specifically related to the shutdown of the Kademlia, let's elaborate more on that. I assume you are running Bee node as a Docker container, would you mind sharing your docker file? If the process is hanging what preceded the Kademlia shutdown? Was it spontaneous or triggered manually by sending a Linux signal etc... Also, would you mind sharing the logs?

tmm360 commented 2 years ago

Yes. I'm trying to improve system resiliency from errors.

Sometime happen to have some issues on architecture, for example the swap point is not reachable for a time, but it could return to be available on a second moment. (this is an example happened today)

I'm currently using Docker Swarm (lol, I like the name overload) as orchestrator. A Nethermind node is running on my cluster, and connected with an internal network with the Bee node.

If for some reason nethermind stops to reply, I get logs like this from Bee: level=warning msg="listener: could not get block number: dial tcp 10.0.8.2:8546: connect: connection refused or it could be like this: level=warning msg="listener: could not get block number: dial tcp: lookup nethermind-xdai on 127.0.0.11:53: no such host"

Anyway, after a while of could not get block number errors, this happen:

time="2022-04-26T00:13:46Z" level=error msg="failed syncing event listener, shutting down node err: postage syncing stalled"
time="2022-04-26T00:13:46Z" level=info msg="api shutting down"
time="2022-04-26T00:13:46Z" level=info msg="pusher shutting down"
time="2022-04-26T00:13:46Z" level=info msg="puller shutting down"
time="2022-04-26T00:13:46Z" level=info msg="pull syncer shutting down"
time="2022-04-26T00:13:46Z" level=info msg="kademlia shutting down"
time="2022-04-26T00:13:46Z" level=info msg="kademlia persisting peer metrics"

and process is kept alive instead of terminate.

mrekucci commented 2 years ago

Duplicate of #2902

tmm360 commented 2 years ago

Duplicate of #2902

All right, I was not sure.

mrekucci commented 2 years ago

Duplicate of #2902

All right, I was not sure

Yes. I'm trying to improve system resiliency from errors.

Sometime happen to have some issues on architecture, for example the swap point is not reachable for a time, but it could return to be available on a second moment. (this is an example happened today)

I'm currently using Docker Swarm (lol, I like the name overload) as orchestrator. A Nethermind node is running on my cluster, and connected with an internal network with the Bee node.

If for some reason nethermind stops to reply, I get logs like this from Bee: level=warning msg="listener: could not get block number: dial tcp 10.0.8.2:8546: connect: connection refused or it could be like this: level=warning msg="listener: could not get block number: dial tcp: lookup nethermind-xdai on 127.0.0.11:53: no such host"

Anyway, after a while of could not get block number errors, this happen:
time="2022-04-26T00:13:46Z" level=error msg="failed syncing event listener, shutting down node err: postage syncing stalled"
time="2022-04-26T00:13:46Z" level=info msg="api shutting down"
time="2022-04-26T00:13:46Z" level=info msg="pusher shutting down"
time="2022-04-26T00:13:46Z" level=info msg="puller shutting down"
time="2022-04-26T00:13:46Z" level=info msg="pull syncer shutting down"
time="2022-04-26T00:13:46Z" level=info msg="kademlia shutting down"
time="2022-04-26T00:13:46Z" level=info msg="kademlia persisting peer metrics"
and process is kept alive instead of terminate.

It would be great if this happened next time, you will terminate the process with kill -ABRT <pid> and copy-paste the output to #2902. Or when running bee in the docker, then list the container name by docker container ls and execute docker kill --signal=ABRT <CONTAINER_NAME> for the desired output.

ethersphere / bee