Open tzmmtz opened 10 months ago
block processing too long
is a top level warning that the node is taking longer than we would normally expect to finalize a block.
This can happen for a number of reasons (to name a few):
The only really interesting log in your log dump is:
[12-26|20:08:04.120] WARN <C Chain> handler/handler.go:486 handling sync message took longer than expected {"processingTime": "33.503478495s", "msgHandlingTime": "33.503477219s", "nodeID": "NodeID-BSDLKewV1jqsYvNWDBkF6cXDAksVfZuM4", "messageOp": "put"
This seems to indicate that handling a single Put
message took an extremely long amount of time. This is not the first time we've seen this log. Generally, the suspicion falls upon the speed of the disk. However, this log also is more a symptom than the actual problem.
Without metrics from the metrics API I'm not really sure how to debug the issue. Maybe with the metrics output we'll see that the average disk IO speed is much slower than expected.
If the metrics don't point to a clear culprit, we could try enabling tracing. However, tracing will likely be much more involved.
One question on your comment:
otherwise NODE will fail if I leave it running.
What do you mean by this? Aside from the original panic (which was fixed by v1.10.15
) I haven't seen anything in the provided logs that would indicate the node crashing. Is there any other unexpected behavior that you have observed than the warning logs or original panic?
I'm going to close your prior issue (#2337) as it was originally created for a bug that was resolved.
[02-01|03:40:23.847] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 375654, "numToExecute": 29544046, "eta": "2214h45m50s"}
[02-01|03:40:54.009] INFO <C Chain> queue/jobs.go:203 executing operations {"numExecuted": 375737, "numToExecute": 29544046, "eta": "2214h55m7s"}
root@avalanche05:/opt/avalanche/storage/logs# lscpu | grep Model
Model name: AMD Ryzen 9 7950X3D 16-Core Processor
C chain block execution seems really bottlenecked at ~2.5 blocks per second. Is this really expected behavior that syncing archive node takes +3 months with latest gen machines?
This issue has become stale because it has been open 60 days with no activity. Adding the lifecycle/frozen
label will cause this issue to ignore lifecycle events.
[04-14|05:58:31.791] INFO <C Chain> queue/jobs.go:202 executing operations {"numExecuted": 6206626, "numToExecute": 6451454, "eta": "11h4m28s"}
To give an update here executing C-chain blocks seems to be taking approx ~3 months if anyone wondered with latest Ryzen @ 5.1Ghz && Corsair MP400 8TBs
This issue has become stale because it has been open 60 days with no activity. Adding the lifecycle/frozen
label will cause this issue to ignore lifecycle events.
Please reference issue #2337 for similar issue as it just started happening again. Seems every three weeks log issues show up. I will have to stop and bootstrap again otherwise NODE will fail if I leave it running.
No replies to my previous issue reported 3 weeks ago.
Attached log file showing errors beginning December 26th. FailingLog.txt
Plenty of SSD space available (450GB), 48GB Ram and 6 Core Mac Pro.