Closed chadsr closed 2 years ago
Absolutely agree. Bee should terminate or try to recover if any error income. In these days I'm testing that very often, for an issue on a Bee node, or an issue with nethermind, I've nodes offline, where a simple automatic container restart could have solved without intervention.
Unable to reproduce the described behavior with the following comcast settings:
comcast --device=en0 --latency=250 --target-bw=1000 --packet-loss=30%
comcast --device=en0 --latency=300 --target-bw=1000 --packet-loss=30%
comcast --device=en0 --latency=300 --target-bw=1000 --packet-loss=40%
comcast --device=en0 --latency=300 --target-bw=1000 --packet-loss=50%
comcast --device=en0 --latency=500 --target-bw=100 --packet-loss=50%
In the last case, the node refused to start because it could not connect to the swap endpoint.
@Chadsr could you please provide the settings (flags) of the tool you've used to simulate the "bad network conditions", the Linux distro and its version, and the docker image and its version (maybe share your docker-compose config)? Did you run the network tool on the docker or on the host?
Also, it would be great if this happened next time, you will terminate the process with kill -ABRT <pid>
and copy-paste the output here. Or when running bee in the docker, then list the container name by docker container ls and execute docker kill --signal=ABRT <CONTAINER_NAME>
; copy-paste the output here.
You'll actually need to use SIGQUIT
not SIGABRT
. On linux at least this signal doesn't work correctly (does not yield the stack traces)
@mrekucci
After trying to reproduce this with a completely fresh bee setup (with and without packet loss), but the same xDai node, I noticed I still get the error:
failed syncing event listener, shutting down node err: get: get batch 9da3a7813977b7722e59de8826077a44736116c4aa5c9f31676da812870a9039: storage: not found
, where the batch ID was always identical.
Therefore, it seems likely that my xDai/Gnosis node suffered from a DB corruption during testing. I will report back once i'm done re-syncing it.
In the meantime, I would suggest considering this issue in a more generalised form of "Bee hangs on irrecoverable errors".
@Chadsr we found this while digging into the shutdown sequence. Could it be that the node was trying to shut down while doing the initial sync?
Should be fixed with #2944
This has not been solved @acud @agazso. Running Bee 1.6.1 in docker, this happen in my case when Nethermind node can't stay aligned with the network for some time. Not a common case, but Bee should handle as specified.
Console logs:
time="2022-06-04T11:20:07Z" level=warning msg="listener: could not get block number: dial tcp 172.30.0.4:8546: connect: connection refused"
time="2022-06-04T11:20:07Z" level=error msg="failed syncing event listener, shutting down node err: postage syncing stalled"
time="2022-06-04T11:20:07Z" level=info msg="api shutting down"
time="2022-06-04T11:20:07Z" level=info msg="pusher shutting down"
time="2022-06-04T11:20:07Z" level=info msg="puller shutting down"
time="2022-06-04T11:20:07Z" level=info msg="pull syncer shutting down"
time="2022-06-04T11:20:07Z" level=warning msg="chainsyncer: failed getting block height for challenge: context canceled"
time="2022-06-04T11:20:07Z" level=error msg="could not get price: dial tcp 172.30.0.4:8546: connect: connection refused"
time="2022-06-04T11:20:07Z" level=info msg="attempting to connect to peer 4a0e074226a3c8b6b6b325d776a7dcd6c9935761fe755f8516beb1f64021b8fe"
time="2022-06-04T11:20:07Z" level=info msg="attempting to connect to peer 3b17f49cf1f1949493592d50c2c066e82bc105f26423423fbe7d6ed18183bfe7"
(...)
time="2022-06-04T11:20:07Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:07Z" level=info msg="attempting to connect to peer 4333528f7261e8640969325279bb2d7859741e6f0aff38fab4b671579413e81d"
time="2022-06-04T11:20:07Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:07Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:09Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:14Z" level=info msg="kademlia shutting down"
time="2022-06-04T11:20:14Z" level=info msg="kademlia persisting peer metrics"
Logs from ABRT signal: abrtlogs.txt
Running Bee 1.6.1 in docker...
@tmm360 can you please confirm that the container fails to exit and hangs right there after the line
kademlia persisting peer metrics
@notanatol correct!
Context
Bee 1.5.1-d0a77598 (Full Node) / Linux 5.17.4-200.fc35.x86_64 / Docker (Compose V2)
Summary
Bee hangs when faced with unstable network conditions.
Expected behavior
Bee is able to either cleanly exit (with error), or attempt to start itself up again, when faced with unstable network conditions.
Actual behavior
Bee hangs after shutting down some of its internal services:
This is the final output before things grind to a halt.
I also attempted to remove statestore/localstore and try with a fresh start in-case this was due to DB corruption, but the same output occurs.
Steps to reproduce
trickle
orcomcast
, etc)Possible solution