Closed RitaMAllenCA closed 1 year ago
I just ran the tests, for reference, the current state is:
failures:
---- reactor::main_reactor::tests::dont_upgrade_without_switch_block stdout ----
Running 'dont_upgrade_without_switch_block' test with rng=TestRng seed: ef117863cb47063d0aa7bdf0ff50e94b
thread 'reactor::main_reactor::tests::dont_upgrade_without_switch_block' panicked at 'network did not settle on condition within 90s', node/src/testing/network.rs:419:33
=======================================================================
Thread: reactor::main_reactor::tests::dont_upgrade_without_switch_block
To reproduce failure, try running with env var:
CL_TEST_SEED=ef117863cb47063d0aa7bdf0ff50e94b
=======================================================================
---- reactor::main_reactor::tests::nodes_should_have_enough_signatures_before_upgrade_with_one_dominant_stake stdout ----
thread 'reactor::main_reactor::tests::nodes_should_have_enough_signatures_before_upgrade_with_one_dominant_stake' panicked at 'network did not settle on condition within 90s', node/src/testing/network.rs:419:33
================================================================================================================
Thread: reactor::main_reactor::tests::nodes_should_have_enough_signatures_before_upgrade_with_one_dominant_stake
To reproduce failure, try running with env var:
CL_TEST_SEED=da3add1621a29dc7541bea4ff2c191d8
================================================================================================================
---- reactor::main_reactor::tests::empty_block_validation_regression stdout ----
thread 'reactor::main_reactor::tests::empty_block_validation_regression' panicked at 'network did not settle on condition within 300s', node/src/testing/network.rs:385:33
=======================================================================
Thread: reactor::main_reactor::tests::empty_block_validation_regression
To reproduce failure, try running with env var:
CL_TEST_SEED=768522d14f2a7a90bcf931ee10ca483d
=======================================================================
---- reactor::main_reactor::tests::run_equivocator_network stdout ----
thread 'reactor::main_reactor::tests::run_equivocator_network' panicked at 'network did not settle on condition within 360s', node/src/testing/network.rs:385:33
=============================================================
Thread: reactor::main_reactor::tests::run_equivocator_network
To reproduce failure, try running with env var:
CL_TEST_SEED=84943fdc77eb76d68dfad998c0a09144
=============================================================
I've looked at dont_upgrade_without_switch_block
:
MainReactor::should_shutdown_for_upgrade
never returns true.ActivationPoint::should_upgrade
, era_being_deactivated.successor()
is never greater than era_id
.So even if the switch block is correctly completed, for some reason, it's never time to shutdown for the upgrade. I can figure out why that's the case, because the code in should_shutdown_for_upgrade
has not been changed for a while, same in all the downstream calls.
Ditching this for now to switch to consensus rewards tasks
It looks like the root-cause of these failures is connected to the fact that the network started from feat-1.6
is not able to progress correctly more than 2 or 3 eras. It slows down more and more and nearly stops. That's why the test don't get the expected messages and timeout kicks in.
Seems like some messages exchanged between nodes are lost; so maybe there is already the backpressure mechanism kicking in or the muxink needs some tweaking.
@marc-casperlabs I think it would be good to have a look at this first, before adding new features. This may indicate that there's some fundamental problem lurking in 1.6 networking code or configuration.
Attached please find logs from an NCTL run. logi.zip
cc: @RitaMAllenCA
The root cause for the original bugs and some of the bugs later on was that when the networking component was changed to be aware of validators, the assumptions was made that it would not be a huge problem if the set of validators was wrong: Since it was only used for prioritization, at worst, validators would get a less preferred treatment. The edge case of no data available was handled by a ValidatorMatrix::is_empty
method, if it returned true
, all peers were treated as equal.
In the meantime, ValidatorBroadcast
was added, which suddenly made the ValidatorMatrix
load bearing where it was not before - if the matrix contents were not up to date, no validator would receive necessary information to progress the chain. This was potentially protected by the is_empty
check, however in the meantime another change in behavior crept in: As initial/background syncing got better, the chance of previous eras being present with "wrong" validators increased, causing the matrix to report non-emptiness while it still had no up-to-date info that actually mattered: For the era in question, it would simply return an empty set, causing again no data to be sent.
The fix is to replace the is_empty
check with the appropriate has_era
check instead (see e8dbfd29f and 12fcdd69f). This, and the replacement of muxink
with juliet
, fixes all issues in the original issue description of this ticket. Further bugs/issues exist, but are considered out-of-scope for this particular ticket.
Housekeeping: Removed the duplicate/redundant release-blocker
tag.
A set of three tests are currently often failing on the
feat-1.6
branch:failures: components::storage::tests::check_force_resync_with_marker_file reactor::main_reactor::tests::empty_block_validation_regression reactor::main_reactor::tests::run_equivocator_network
However, some of them (at leastcheck_force_resync_with_marker_file
) will succeed if rerun in isolation. These tests need to be reexamined, either to find the reason for them failing legitimately or their flakiness.