Closed tmm360 closed 2 years ago
@tmm360 thanks for the issue. Except the Kademlia shutdown, could you please elaborate more on what critical issues do you specifically encounter that you think the process should be terminated when they rise?
@mrekucci I don't see any currently, other than kademlia shutdown. But this is a specific case for a general behavior, in my opinion. Node always should be killed on critical errors, if it can't recover internally. Process should never enter in idle state
@mrekucci I don't see any currently, other than kademlia shutdown. But this is a specific case for a general behavior, in my opinion. Node always should be killed on critical errors, if it can't recover internally. Process should never enter in idle state
From your comments, I assume that your problem is specifically related to the shutdown of the Kademlia, let's elaborate more on that. I assume you are running Bee node as a Docker container, would you mind sharing your docker file? If the process is hanging what preceded the Kademlia shutdown? Was it spontaneous or triggered manually by sending a Linux signal etc... Also, would you mind sharing the logs?
Yes. I'm trying to improve system resiliency from errors.
Sometime happen to have some issues on architecture, for example the swap point is not reachable for a time, but it could return to be available on a second moment. (this is an example happened today)
I'm currently using Docker Swarm (lol, I like the name overload) as orchestrator. A Nethermind node is running on my cluster, and connected with an internal network with the Bee node.
If for some reason nethermind stops to reply, I get logs like this from Bee:
level=warning msg="listener: could not get block number: dial tcp 10.0.8.2:8546: connect: connection refused
or it could be like this:
level=warning msg="listener: could not get block number: dial tcp: lookup nethermind-xdai on 127.0.0.11:53: no such host"
Anyway, after a while of could not get block number
errors, this happen:
time="2022-04-26T00:13:46Z" level=error msg="failed syncing event listener, shutting down node err: postage syncing stalled"
time="2022-04-26T00:13:46Z" level=info msg="api shutting down"
time="2022-04-26T00:13:46Z" level=info msg="pusher shutting down"
time="2022-04-26T00:13:46Z" level=info msg="puller shutting down"
time="2022-04-26T00:13:46Z" level=info msg="pull syncer shutting down"
time="2022-04-26T00:13:46Z" level=info msg="kademlia shutting down"
time="2022-04-26T00:13:46Z" level=info msg="kademlia persisting peer metrics"
and process is kept alive instead of terminate.
Duplicate of #2902
Duplicate of #2902
All right, I was not sure.
Duplicate of #2902
All right, I was not sure
Yes. I'm trying to improve system resiliency from errors.
Sometime happen to have some issues on architecture, for example the swap point is not reachable for a time, but it could return to be available on a second moment. (this is an example happened today)
I'm currently using Docker Swarm (lol, I like the name overload) as orchestrator. A Nethermind node is running on my cluster, and connected with an internal network with the Bee node.
If for some reason nethermind stops to reply, I get logs like this from Bee:
level=warning msg="listener: could not get block number: dial tcp 10.0.8.2:8546: connect: connection refused
or it could be like this:level=warning msg="listener: could not get block number: dial tcp: lookup nethermind-xdai on 127.0.0.11:53: no such host"
Anyway, after a while of
could not get block number
errors, this happen:time="2022-04-26T00:13:46Z" level=error msg="failed syncing event listener, shutting down node err: postage syncing stalled" time="2022-04-26T00:13:46Z" level=info msg="api shutting down" time="2022-04-26T00:13:46Z" level=info msg="pusher shutting down" time="2022-04-26T00:13:46Z" level=info msg="puller shutting down" time="2022-04-26T00:13:46Z" level=info msg="pull syncer shutting down" time="2022-04-26T00:13:46Z" level=info msg="kademlia shutting down" time="2022-04-26T00:13:46Z" level=info msg="kademlia persisting peer metrics"
and process is kept alive instead of terminate.
It would be great if this happened next time, you will terminate the process with kill -ABRT <pid>
and copy-paste the output to #2902. Or when running bee in the docker, then list the container name by docker container ls
and execute docker kill --signal=ABRT <CONTAINER_NAME>
for the desired output.
Summary
Currently process is kept alive, but this doesn't permit to handle the error without an active check. Please also see this issue: https://github.com/ethersphere/bee/issues/2556
Motivation
Container orchestrators can try to restart the container automatically, but it has to terminate first. Any blocking issue should terminate the process, so IF the process is recoverable with a restart, the orchestrator can perform automatically. Otherwise, a script have to actively check if node is reachable, and eventually that it isn't try to kill and restart. Not much user friendly, it add complexity, and I don't see rationale under choose to not kill the process.
Implementation
Simply kill the process when a critical issue come on an active node.
Drawbacks
Don't see any.