Bee hangs under unstable network conditions / timeouts

chadsr commented 2 years ago

Context

Bee 1.5.1-d0a77598 (Full Node) / Linux 5.17.4-200.fc35.x86_64 / Docker (Compose V2)

Summary

Bee hangs when faced with unstable network conditions.

Expected behavior

Bee is able to either cleanly exit (with error), or attempt to start itself up again, when faced with unstable network conditions.

Actual behavior

Bee hangs after shutting down some of its internal services:

time="2022-04-25T14:13:33Z" level=error msg="failed syncing event listener, shutting down node err: get: get batch 9da3a7813977b7722e59de8826077a44736116c4aa5c9f31676da812870a9039: storage: not found"
time="2022-04-25T14:13:33Z" level=info msg="kademlia shutting down"
time="2022-04-25T14:13:38Z" level=warning msg="kademlia manage loop did not shut down properly"
time="2022-04-25T14:13:38Z" level=info msg="kademlia persisting peer metrics"
time="2022-04-25T14:13:38Z" level=debug msg="kademlia: Finalize(...) took 55.325µs"
time="2022-04-25T14:13:43Z" level=error msg="failed shutting down node: 1 error occurred:\n\t* topology driver: timeout\n\n"

This is the final output before things grind to a halt.

I also attempted to remove statestore/localstore and try with a fresh start in-case this was due to DB corruption, but the same output occurs.

Steps to reproduce

Simulate bad network conditions (e.g. using trickle or comcast, etc)
Run a local xDai node (fully synced before bad network conditions)
Start Bee under bad network conditions
Observe timeouts/halts, but no exit or retries to continue.

Possible solution

Bee should (fully) exit with an error
Bee should attempt to reconnect the various components that timed out / errored.

tmm360 commented 2 years ago

Absolutely agree. Bee should terminate or try to recover if any error income. In these days I'm testing that very often, for an issue on a Bee node, or an issue with nethermind, I've nodes offline, where a simple automatic container restart could have solved without intervention.

mrekucci commented 2 years ago

Unable to reproduce the described behavior with the following comcast settings:

comcast --device=en0 --latency=250 --target-bw=1000 --packet-loss=30%
comcast --device=en0 --latency=300 --target-bw=1000 --packet-loss=30%
comcast --device=en0 --latency=300 --target-bw=1000 --packet-loss=40%
comcast --device=en0 --latency=300 --target-bw=1000 --packet-loss=50%
comcast --device=en0 --latency=500 --target-bw=100 --packet-loss=50%

In the last case, the node refused to start because it could not connect to the swap endpoint.

@Chadsr could you please provide the settings (flags) of the tool you've used to simulate the "bad network conditions", the Linux distro and its version, and the docker image and its version (maybe share your docker-compose config)? Did you run the network tool on the docker or on the host?

Also, it would be great if this happened next time, you will terminate the process with kill -ABRT <pid> and copy-paste the output here. Or when running bee in the docker, then list the container name by docker container ls and execute docker kill --signal=ABRT <CONTAINER_NAME>; copy-paste the output here.

acud commented 2 years ago

You'll actually need to use SIGQUIT not SIGABRT. On linux at least this signal doesn't work correctly (does not yield the stack traces)

chadsr commented 2 years ago

@mrekucci

After trying to reproduce this with a completely fresh bee setup (with and without packet loss), but the same xDai node, I noticed I still get the error: failed syncing event listener, shutting down node err: get: get batch 9da3a7813977b7722e59de8826077a44736116c4aa5c9f31676da812870a9039: storage: not found, where the batch ID was always identical.

Therefore, it seems likely that my xDai/Gnosis node suffered from a DB corruption during testing. I will report back once i'm done re-syncing it.

In the meantime, I would suggest considering this issue in a more generalised form of "Bee hangs on irrecoverable errors".

acud commented 2 years ago

@Chadsr we found this while digging into the shutdown sequence. Could it be that the node was trying to shut down while doing the initial sync?

acud commented 2 years ago

Should be fixed with #2944

tmm360 commented 2 years ago

This has not been solved @acud @agazso. Running Bee 1.6.1 in docker, this happen in my case when Nethermind node can't stay aligned with the network for some time. Not a common case, but Bee should handle as specified.

Console logs:

time="2022-06-04T11:20:07Z" level=warning msg="listener: could not get block number: dial tcp 172.30.0.4:8546: connect: connection refused"
time="2022-06-04T11:20:07Z" level=error msg="failed syncing event listener, shutting down node err: postage syncing stalled"
time="2022-06-04T11:20:07Z" level=info msg="api shutting down"
time="2022-06-04T11:20:07Z" level=info msg="pusher shutting down"
time="2022-06-04T11:20:07Z" level=info msg="puller shutting down"
time="2022-06-04T11:20:07Z" level=info msg="pull syncer shutting down"
time="2022-06-04T11:20:07Z" level=warning msg="chainsyncer: failed getting block height for challenge: context canceled"
time="2022-06-04T11:20:07Z" level=error msg="could not get price: dial tcp 172.30.0.4:8546: connect: connection refused"
time="2022-06-04T11:20:07Z" level=info msg="attempting to connect to peer 4a0e074226a3c8b6b6b325d776a7dcd6c9935761fe755f8516beb1f64021b8fe"
time="2022-06-04T11:20:07Z" level=info msg="attempting to connect to peer 3b17f49cf1f1949493592d50c2c066e82bc105f26423423fbe7d6ed18183bfe7"

(...)

time="2022-06-04T11:20:07Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:07Z" level=info msg="attempting to connect to peer 4333528f7261e8640969325279bb2d7859741e6f0aff38fab4b671579413e81d"
time="2022-06-04T11:20:07Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:07Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:09Z" level=warning msg="peer not reachable when attempting to connect"
time="2022-06-04T11:20:14Z" level=info msg="kademlia shutting down"
time="2022-06-04T11:20:14Z" level=info msg="kademlia persisting peer metrics"

Logs from ABRT signal: abrtlogs.txt

notanatol commented 2 years ago

Running Bee 1.6.1 in docker...

@tmm360 can you please confirm that the container fails to exit and hangs right there after the line

kademlia persisting peer metrics

tmm360 commented 2 years ago

@notanatol correct!

ethersphere / bee