apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.33k stars 1.3k forks source link

Fix and re-enable DataLossRecovery test #5850

Closed sfc-gh-jslocum closed 2 years ago

sfc-gh-jslocum commented 2 years ago

DataLoss Recovery has been failing regularly in the snowflake nightlies. Until the issues with this test can be fixed, it was disabled in https://github.com/apple/foundationdb/issues/5847 (PR https://github.com/apple/foundationdb/pull/5848).

Once the issues are fixed, and DataLossRecovery passes a sufficient number of joshua runs with no failures, we should re-enable the test on master.

liquid-helium commented 2 years ago

The test is failing with the following event sequence:

  1. Keyrange [TestKey, TestKey0) is moved to a single storage server SS1
  2. SS1 is killed and the SS file is deleted
  3. SS1 is excluded as failed
  4. The addresses of SS1 is marked as 'FAIL' in TC's excludedServers
  5. StorageServerTracker for SS1 is supposed to find out it is marked as failed, and throw a movekeys_conflict which will trigger the remove server process, however, this never happened. When the SS1 tracker checks the excludedServers after a while, it didn't see the FAIL status, the status was reset somehow, I traced all invocations of excludedServers.set(), but was not able to find out who reset the status.
liquid-helium commented 2 years ago

It turns out the remote TeamCollection never started since the remote DC is not fully recovered. I will disable HA mode for the test since the feature was never meant for HA anyway.

liquid-helium commented 2 years ago

The fix was checked in and test re-enabled. https://github.com/apple/foundationdb/pull/5868