stakesafebart commented 1 year ago

Did you read the documentation and guides?

[X] I have inspected the documentation.

Is there an existing issue?

[X] I have searched the existing issues.

Description of the problem

Several aleph-node runners have reported becoming stalled on a FinalizedBlock.

Based on the information provided, this seems to occur on the first or second block of a new session.

When it happens on the first block, the node appears to have created one of the last blocks of the previous session.

There seems to be a connection between the numbers. Hopefully, this information helps to track down the bug.

SC0T-1 stuck on #41795100

41795100 (Proposal from AZF/TURING) was the first block of session 46439

SC0T-1 was in the previous committee 46438 and the last block it proposed was #41795096 (4 blocks ahead)

Piconbello stuck on #42563700 42563700 (Proposal from AZF/FEYNMAN) -> first block of session 47293 Piconbello was in the previous committee 47292 and the last block it proposed was #42563696 (4 blocks ahead)

StakeSafe Amsterdam stuck on #42022801

42022801 (Proposal from AZF/LOVELACE) was the second block of session 46692

StakeSafe Amsterdam was not in the previous committee 45726

StakeSafe Rotterdam stuck on #41155201

41155201 (Proposal from Contador) was the second block of session 45728

StakeSafe Rotterdam was not in the previous committee 45727

StakeSafe Almere stuck on #27009901 Aleph Zero Testnet

27009901 (Proposal from AZF/FERMAT) was the second block of session 30011

StakeSafe Almere was not in the previous committee 30010

StakeSafe Almere stuck on #27171901 Aleph Zero Testnet

27171901 (Proposal from AZF/Hilbert) was the second block of session session 30191

StakeSafe Almere was not in the previous committee 30190

Information on your setup.

This happened at Testnet and Mainnet with several configurations.

Steps to reproduce

We are not able to reproduce. But we are monitoring all cases.

Did you attach relevant logs?

[X] I have attached logs (if relevant).

sj-graves commented 1 year ago

Graves Validation - Mainnet experienced this during the wee hours of the morning last night when not in session and --apparently -- automatically recovered without restart. GV-Mainnet+2023-04-03+11-10-08+azero.log

Marcin-Radecki commented 1 year ago

Hi, we're currently investigating provided logs. We'll get back to you as soon as we can.

tspring80 commented 1 year ago

I got the same issue, now a couple of times. My Grafana was alerted me a block finalizing >25. At Sunday I got multiple errors between night. Then I was restart and the issue was fixed. Today I got same issue again. After restart everything worked fine.

Marcin-Radecki commented 1 year ago

Hi, there's an issue that sometimes causes justifications (finality proofs) to not be distributed between nodes, which then causes a node's finalization stuck. The workaround is a node restart. We're working towards a fix, but it is blocked by the major revamp of justification sync code that we'll be part of the next release.

QNODE-CA commented 1 year ago

may I ask if we know why sometimes (most of the time?) nodes self-correct this problem? I've tried to look at the logs whenever a node of mine jumps from a large gap to the usual single-digits one, but can't see anything that would explain that. for instance:

Marcin-Radecki commented 1 year ago

Hi, it's most likely a manifestation of the issue and the fact it's only a 1-2 minute spike in gap might come from different reasons like network condition, current chain state of DAG etc.

QNODE-CA commented 1 year ago

Last night I noticed a nice pattern in gap "spikes". The same 16 block gap spike repeated neatly every 900s (15m) on my mainnet validator node. It's the first time I noticed this kind of regularity, I thought it was maybe worth mentioning. PDT time.

2023-04-08 02:57:49 44323215 44323199 16 2023-04-08 03:12:49 44324115 44324099 16 2023-04-08 03:27:49 44325015 44324999 16 2023-04-08 03:42:49 44325915 44325899 16

azero mainnet block gap pattern 15m crop

DamianStraszak commented 1 year ago

Thanks for the note Luca. The pattern is indeed interesting. I think it's some kind of superposition of how AlephBFT works, and how the nodes request justifications at some regular intervals and for particular targets. I believe after the rewrite of the justification sync system that we are cooking for the next release, things should improve quite a bit.

QNODE-CA commented 1 year ago

One more "gap" pattern from my mainnet validator. This one is different though: the same gap = 20 is repeated for about 3 minutes (blocks stuck at best=45362654, finalized=45362634), then it disappears and go back to the usual noise. PDT time.

azero mainnet block gap pattern 20 - apr-20-2023 cropped

best-finalized_gap_20_repeating - apr-20-2023.csv

timorl commented 1 year ago

The latest release likely fixed these issues, so if someone is still experiencing these problems on testnet and after updating to 11.1. In the absence of reports this will be closed sometime next Friday, 2023-06-09.

timorl commented 1 year ago

It seems there are no further complaints, so closing.

Cardinal-Cryptography / aleph-node-issues

Analyzation of block numbers where nodes reported to be stalled on #3

Did you read the documentation and guides?

Is there an existing issue?

Description of the problem

41795100 (Proposal from AZF/TURING) was the first block of session 46439

42022801 (Proposal from AZF/LOVELACE) was the second block of session 46692

41155201 (Proposal from Contador) was the second block of session 45728

27009901 (Proposal from AZF/FERMAT) was the second block of session 30011

27171901 (Proposal from AZF/Hilbert) was the second block of session session 30191

Information on your setup.

Steps to reproduce

Did you attach relevant logs?