Closed stakesafebart closed 1 year ago
Graves Validation - Mainnet experienced this during the wee hours of the morning last night when not in session and --apparently -- automatically recovered without restart. GV-Mainnet+2023-04-03+11-10-08+azero.log
Hi, we're currently investigating provided logs. We'll get back to you as soon as we can.
I got the same issue, now a couple of times. My Grafana was alerted me a block finalizing >25. At Sunday I got multiple errors between night. Then I was restart and the issue was fixed. Today I got same issue again. After restart everything worked fine.
Hi, there's an issue that sometimes causes justifications (finality proofs) to not be distributed between nodes, which then causes a node's finalization stuck. The workaround is a node restart. We're working towards a fix, but it is blocked by the major revamp of justification sync code that we'll be part of the next release.
may I ask if we know why sometimes (most of the time?) nodes self-correct this problem? I've tried to look at the logs whenever a node of mine jumps from a large gap to the usual single-digits one, but can't see anything that would explain that. for instance:
Hi, it's most likely a manifestation of the issue and the fact it's only a 1-2 minute spike in gap might come from different reasons like network condition, current chain state of DAG etc.
Last night I noticed a nice pattern in gap "spikes". The same 16 block gap spike repeated neatly every 900s (15m) on my mainnet validator node. It's the first time I noticed this kind of regularity, I thought it was maybe worth mentioning. PDT time.
2023-04-08 02:57:49 44323215 44323199 16
2023-04-08 03:12:49 44324115 44324099 16
2023-04-08 03:27:49 44325015 44324999 16
2023-04-08 03:42:49 44325915 44325899 16
Thanks for the note Luca. The pattern is indeed interesting. I think it's some kind of superposition of how AlephBFT works, and how the nodes request justifications at some regular intervals and for particular targets. I believe after the rewrite of the justification sync system that we are cooking for the next release, things should improve quite a bit.
One more "gap" pattern from my mainnet validator. This one is different though: the same gap = 20 is repeated for about 3 minutes (blocks stuck at best=45362654, finalized=45362634), then it disappears and go back to the usual noise. PDT time.
The latest release likely fixed these issues, so if someone is still experiencing these problems on testnet and after updating to 11.1. In the absence of reports this will be closed sometime next Friday, 2023-06-09.
It seems there are no further complaints, so closing.
Did you read the documentation and guides?
Is there an existing issue?
Description of the problem
Several aleph-node runners have reported becoming stalled on a FinalizedBlock.
Based on the information provided, this seems to occur on the first or second block of a new session.
When it happens on the first block, the node appears to have created one of the last blocks of the previous session.
There seems to be a connection between the numbers. Hopefully, this information helps to track down the bug.
SC0T-1 stuck on #41795100
41795100 (Proposal from AZF/TURING) was the first block of session 46439
SC0T-1 was in the previous committee 46438 and the last block it proposed was #41795096 (4 blocks ahead)
Piconbello stuck on #42563700 42563700 (Proposal from AZF/FEYNMAN) -> first block of session 47293 Piconbello was in the previous committee 47292 and the last block it proposed was #42563696 (4 blocks ahead)
StakeSafe Amsterdam stuck on #42022801
42022801 (Proposal from AZF/LOVELACE) was the second block of session 46692
StakeSafe Amsterdam was not in the previous committee 45726
StakeSafe Rotterdam stuck on #41155201
41155201 (Proposal from Contador) was the second block of session 45728
StakeSafe Rotterdam was not in the previous committee 45727
StakeSafe Almere stuck on #27009901 Aleph Zero Testnet
27009901 (Proposal from AZF/FERMAT) was the second block of session 30011
StakeSafe Almere was not in the previous committee 30010
StakeSafe Almere stuck on #27171901 Aleph Zero Testnet
27171901 (Proposal from AZF/Hilbert) was the second block of session session 30191
StakeSafe Almere was not in the previous committee 30190
Information on your setup.
This happened at Testnet and Mainnet with several configurations.
Steps to reproduce
We are not able to reproduce. But we are monitoring all cases.
Did you attach relevant logs?