apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.6k stars 1.32k forks source link

Data movement can't begin on clusters missing data #3774

Open ajbeamon opened 4 years ago

ajbeamon commented 4 years ago

When a cluster loses all replicas of a shard and the data distributor later restarts, it gets stuck trying to track the initial shards (Note: I'm not certain if this is universally true or if it requires other properties to hold). As a result, no data movement can happen with the data that still exists in the cluster.

It would be better if data movement could continue on the shards that remain, which could help us to prevent increasing the blast radius of this failure case in some circumstances.

xumengpanda commented 4 years ago

If system keyspace's shards are lost, DD will get stuck because it cannot recover its states, like shard to SS mapping. If only normal shards are lost, DD should work in theory.

I think the best way to do this is to reproduce the problem in simulation first.