Open ajbeamon opened 4 years ago
If system keyspace's shards are lost, DD will get stuck because it cannot recover its states, like shard to SS mapping. If only normal shards are lost, DD should work in theory.
I think the best way to do this is to reproduce the problem in simulation first.
When a cluster loses all replicas of a shard and the data distributor later restarts, it gets stuck trying to track the initial shards (Note: I'm not certain if this is universally true or if it requires other properties to hold). As a result, no data movement can happen with the data that still exists in the cluster.
It would be better if data movement could continue on the shards that remain, which could help us to prevent increasing the blast radius of this failure case in some circumstances.