fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

guardian_backup flake #5163

Closed dpc closed 1 day ago

dpc commented 2 weeks ago
 00:05:15 2024-04-29T18:18:05.504871Z  INFO devimint::tests: Caught up to block 102 of at least 122
00:05:15 thread 'main' panicked at /build/source/devimint/src/tests.rs:2266:6:
00:05:15 Peer didn't rejoin federation: Polling Peer catches up again failed after 99 retries (timeout: 60s)

https://github.com/fedimint/fedimint/actions/runs/8883243488/job/24389687956?pr=5159

When trying a PR that lowers the session time when running tests from 2 minutes to around 10s, we've hit a bug in guardian_backup where the peer is restarted and expected to reach the same wallet module onchain height as before the restart.

This is a normal test-ci-all run which means i think we default do one peer being down from start. Which means on restart of a peer we might be losing consensus in some way, and I think smaller session time exposes us much to some timing condintion (another benefit of lowering session time in tests).

Unfortunately I don't have deeper logs from it. I'll try to repro locally.

dpc commented 2 weeks ago

@elsirion @joschisan might be of interest

elsirion commented 2 weeks ago

i think we default do one peer being down from start.

I explicitly turned that off for this test.

https://github.com/fedimint/fedimint/blob/623f6cc16007c099a59cc3fd47ea6c1d6cf95de7/scripts/tests/test-ci-all.sh#L97-L100

What's also interesting is that we first make some progress up to 102 and then stop 20 blocks short of the target, indicating a stuck consensus indeed.

…18:17:10.761438Z  INFO devimint::tests: Caught up to block 0 of at least 122
…18:17:11.110710Z  INFO devimint::tests: Caught up to block 102 of at least 122
elsirion commented 2 days ago

Have we seen this again recently?

dpc commented 2 days ago

I don't remember seeing it since last time there was some activity around it.