Open mdelle1 opened 1 month ago
@raychu86 is this PR ready to go (assuming all tests pass)?
I did some light testing w/ a 4 validator devnet with these changes merged into the latest mainnet-staging and every time I restarted a validator or resynced a validator from genesis I encountered various errors:
~Interesting findings. I'm not able to reproduce after 5 tries on an M2 Max. Can you share the machine specs used? I assume a slower machine may induce these conditions.~ Correction, I can reproduce on M2 Max.
mainnet-staging
81ca9cf6f after 3 tries waiting 10 minutes, there was no issue.mainnet-staging
merged in (resulting commit c4c9fb305), with deleting batch proposal cache, after waiting 4 minutes, I was also able to trigger it...mainnet-staging
merged in (resulting commit c4c9fb305), without deleting batch proposal cache, after 2 tries waiting 5 minutes, I was also able to trigger it...Since reported errors have not been addressed, holding off on merge for now, and therefore holding off on adding this prior to code freeze. If we want this, it will have to be after launch.
@mdelle1 Have you had a chance to take a look at the issues being observed with the change?
Motivation
This PR focuses on coupling block sync to DAG state replication. When a node is syncing via block responses, it will sync its storage and DAG with the certificates contained in the block and attempt to update its ledger. Previously, there were scenarios where a node would commit certificates in its DAG without advancing blocks. Instead, the committal of certificates and advancement of blocks during sync should be coupled. This PR commits certificates in the DAG only when blocks are advanced to in the sync module and creates a channel to the BFT to ensure that the leader certificate of the block being added was recently committed in the BFT.