Fix ~flaky~ failing tests in 1.6 branch

RitaMAllenCA commented 1 year ago

A set of three tests are currently often failing on the feat-1.6 branch: failures: components::storage::tests::check_force_resync_with_marker_file reactor::main_reactor::tests::empty_block_validation_regression reactor::main_reactor::tests::run_equivocator_network However, some of them (at least check_force_resync_with_marker_file) will succeed if rerun in isolation. These tests need to be reexamined, either to find the reason for them failing legitimately or their flakiness.

real-felix commented 1 year ago

I just ran the tests, for reference, the current state is:

failures:

---- reactor::main_reactor::tests::dont_upgrade_without_switch_block stdout ----
Running 'dont_upgrade_without_switch_block' test with rng=TestRng seed: ef117863cb47063d0aa7bdf0ff50e94b
thread 'reactor::main_reactor::tests::dont_upgrade_without_switch_block' panicked at 'network did not settle on condition within 90s', node/src/testing/network.rs:419:33

=======================================================================
Thread: reactor::main_reactor::tests::dont_upgrade_without_switch_block
To reproduce failure, try running with env var:
CL_TEST_SEED=ef117863cb47063d0aa7bdf0ff50e94b
=======================================================================

---- reactor::main_reactor::tests::nodes_should_have_enough_signatures_before_upgrade_with_one_dominant_stake stdout ----
thread 'reactor::main_reactor::tests::nodes_should_have_enough_signatures_before_upgrade_with_one_dominant_stake' panicked at 'network did not settle on condition within 90s', node/src/testing/network.rs:419:33

================================================================================================================
Thread: reactor::main_reactor::tests::nodes_should_have_enough_signatures_before_upgrade_with_one_dominant_stake
To reproduce failure, try running with env var:
CL_TEST_SEED=da3add1621a29dc7541bea4ff2c191d8
================================================================================================================

---- reactor::main_reactor::tests::empty_block_validation_regression stdout ----
thread 'reactor::main_reactor::tests::empty_block_validation_regression' panicked at 'network did not settle on condition within 300s', node/src/testing/network.rs:385:33

=======================================================================
Thread: reactor::main_reactor::tests::empty_block_validation_regression
To reproduce failure, try running with env var:
CL_TEST_SEED=768522d14f2a7a90bcf931ee10ca483d
=======================================================================

---- reactor::main_reactor::tests::run_equivocator_network stdout ----
thread 'reactor::main_reactor::tests::run_equivocator_network' panicked at 'network did not settle on condition within 360s', node/src/testing/network.rs:385:33

=============================================================
Thread: reactor::main_reactor::tests::run_equivocator_network
To reproduce failure, try running with env var:
CL_TEST_SEED=84943fdc77eb76d68dfad998c0a09144
=============================================================

real-felix commented 1 year ago

I've looked at dont_upgrade_without_switch_block:

the problem is that MainReactor::should_shutdown_for_upgrade never returns true.
It's because in ActivationPoint::should_upgrade, era_being_deactivated.successor() is never greater than era_id.

So even if the switch block is correctly completed, for some reason, it's never time to shutdown for the upgrade. I can figure out why that's the case, because the code in should_shutdown_for_upgrade has not been changed for a while, same in all the downstream calls.

real-felix commented 1 year ago

Ditching this for now to switch to consensus rewards tasks

rafal-ch commented 1 year ago

It looks like the root-cause of these failures is connected to the fact that the network started from feat-1.6 is not able to progress correctly more than 2 or 3 eras. It slows down more and more and nearly stops. That's why the test don't get the expected messages and timeout kicks in.

Seems like some messages exchanged between nodes are lost; so maybe there is already the backpressure mechanism kicking in or the muxink needs some tweaking.

@marc-casperlabs I think it would be good to have a look at this first, before adding new features. This may indicate that there's some fundamental problem lurking in 1.6 networking code or configuration.

Attached please find logs from an NCTL run. logi.zip

cc: @RitaMAllenCA

marc-casperlabs commented 1 year ago

The root cause for the original bugs and some of the bugs later on was that when the networking component was changed to be aware of validators, the assumptions was made that it would not be a huge problem if the set of validators was wrong: Since it was only used for prioritization, at worst, validators would get a less preferred treatment. The edge case of no data available was handled by a ValidatorMatrix::is_empty method, if it returned true, all peers were treated as equal.

In the meantime, ValidatorBroadcast was added, which suddenly made the ValidatorMatrix load bearing where it was not before - if the matrix contents were not up to date, no validator would receive necessary information to progress the chain. This was potentially protected by the is_empty check, however in the meantime another change in behavior crept in: As initial/background syncing got better, the chance of previous eras being present with "wrong" validators increased, causing the matrix to report non-emptiness while it still had no up-to-date info that actually mattered: For the era in question, it would simply return an empty set, causing again no data to be sent.

The fix is to replace the is_empty check with the appropriate has_era check instead (see e8dbfd29f and 12fcdd69f). This, and the replacement of muxink with juliet, fixes all issues in the original issue description of this ticket. Further bugs/issues exist, but are considered out-of-scope for this particular ticket.

devendran-m commented 7 hours ago

Housekeeping: Removed the duplicate/redundant release-blocker tag.

casper-network / casper-node

Fix ~flaky~ failing tests in 1.6 branch #3806