ava-labs / avalanchego

Go implementation of an Avalanche node.
https://avax.network
BSD 3-Clause "New" or "Revised" License
2.13k stars 671 forks source link

WARN block processing too long #2552

Open tzmmtz opened 10 months ago

tzmmtz commented 10 months ago

Please reference issue #2337 for similar issue as it just started happening again. Seems every three weeks log issues show up. I will have to stop and bootstrap again otherwise NODE will fail if I leave it running.

No replies to my previous issue reported 3 weeks ago.

Attached log file showing errors beginning December 26th. FailingLog.txt

Plenty of SSD space available (450GB), 48GB Ram and 6 Core Mac Pro.

StephenButtolph commented 10 months ago

block processing too long is a top level warning that the node is taking longer than we would normally expect to finalize a block.

This can happen for a number of reasons (to name a few):

The only really interesting log in your log dump is:

[12-26|20:08:04.120] WARN <C Chain> handler/handler.go:486 handling sync message took longer than expected {"processingTime": "33.503478495s", "msgHandlingTime": "33.503477219s", "nodeID": "NodeID-BSDLKewV1jqsYvNWDBkF6cXDAksVfZuM4", "messageOp": "put"

This seems to indicate that handling a single Put message took an extremely long amount of time. This is not the first time we've seen this log. Generally, the suspicion falls upon the speed of the disk. However, this log also is more a symptom than the actual problem.

Without metrics from the metrics API I'm not really sure how to debug the issue. Maybe with the metrics output we'll see that the average disk IO speed is much slower than expected.

If the metrics don't point to a clear culprit, we could try enabling tracing. However, tracing will likely be much more involved.

One question on your comment:

otherwise NODE will fail if I leave it running.

What do you mean by this? Aside from the original panic (which was fixed by v1.10.15) I haven't seen anything in the provided logs that would indicate the node crashing. Is there any other unexpected behavior that you have observed than the warning logs or original panic?

I'm going to close your prior issue (#2337) as it was originally created for a bug that was resolved.

hitchhooker commented 9 months ago
[02-01|03:40:23.847] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 375654, "numToExecute": 29544046, "eta": "2214h45m50s"}
[02-01|03:40:54.009] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 375737, "numToExecute": 29544046, "eta": "2214h55m7s"}
root@avalanche05:/opt/avalanche/storage/logs# lscpu | grep Model
Model name:                         AMD Ryzen 9 7950X3D 16-Core Processor

C chain block execution seems really bottlenecked at ~2.5 blocks per second. Is this really expected behavior that syncing archive node takes +3 months with latest gen machines?

github-actions[bot] commented 7 months ago

This issue has become stale because it has been open 60 days with no activity. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.

hitchhooker commented 7 months ago

[04-14|05:58:31.791] INFO <C Chain> queue/jobs.go:202 executing operations {"numExecuted": 6206626, "numToExecute": 6451454, "eta": "11h4m28s"} To give an update here executing C-chain blocks seems to be taking approx ~3 months if anyone wondered with latest Ryzen @ 5.1Ghz && Corsair MP400 8TBs

github-actions[bot] commented 4 months ago

This issue has become stale because it has been open 60 days with no activity. Adding the lifecycle/frozen label will cause this issue to ignore lifecycle events.