apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.44k stars 1.31k forks source link

Drop keyrange after data loss #5604

Closed liquid-helium closed 3 years ago

liquid-helium commented 3 years ago

When a team is lost, e.g., the machines/disks are gone, we want to bring the cluster into a consistent state.

  1. The failed servers need to be removed, which can be done with fdbcli exclude failed command.

  2. In addition, the keyrange should be emptied, and be assigned to a new team.

After the above 2 operations, restore can be performed.

Currently, 1 is in place, however, after excluding the all servers of a team, the keyrange in that team becomes unavailable, and cannot be moved since all source servers are gone. Also DD will end up in crash-looping.

dongxinEric commented 3 years ago

Also DD will end up in crash-looping.

Well why DD would crash when a key range's data nuked due to losing all servers in a team? That sounds like a bug. It should scream about that, but not crash.

liquid-helium commented 3 years ago

Also DD will end up in crash-looping.

Well why DD would crash when a key range's data nuked due to losing all servers in a team? That sounds like a bug.

When the initial team is loaded, empty source servers are not checked, so there will be team without any source servers. Then the TeamTracker thinks that team is unhealthy, and tries to get the SS's info, that when it crashes.

It should scream about that, but not crash.

That's a good point. Let me create another bug to fix that.

One follow-up question is do we expect it is an invariant that teams should not be empty. If we allow the existence of empty teams , then DD can handle this situation gracefully, i.e., by moving the empty range to another team. However, it is not easy to tell if the team is empty due to bug or human operations. Hence, it might make sense to treat empty team as an error, what do you think?

liquid-helium commented 3 years ago

Here is the new issue: https://github.com/apple/foundationdb/issues/5617