Automatically re-balance under-replicated partitions

pereferrera commented 11 years ago

Splout SQL is CAP : because data is indexed offline and served in a read-only fashion, and because we have replication for fail-over, everything inside the CAP theorem is met.

However, we can't say that Splout SQL is exactly "highly" available. It is true that if one machine dies and there are other replicas, the partitions in that replica can still be served. But eventually, if the node is not manually repaired, the partition may run out of replicas (more machines die) and not be available anymore.

The idea is to implement a simple mechanism for transferring partition data from one replica to another (newly designated) replica. So if one partition is under-replicated, a new replica can be created on the fly. Some things to keep in mind:

The decission has to be somehow centralized (maybe through Hazelcast?) so that we prevent every node from start pulling data from the partition - just one node should do it.
The process has to be coordinated so that if it fails, another one is tried - probably through Hazelcast too.
Care has to be taken not to run out of disk in DNodes, but I'm not sure what we need to do exactly for this.
If the dead node is then repaired, the partition might be over-replicated. Not sure if we need to do something about it.

pereferrera commented 11 years ago

There should be some (configurable) waiting time before initiating a copy, otherwise multiple copies might be triggered at cluster startup, when all partitions will be under-replicated.

pereferrera commented 11 years ago

This is for another issue, but we should have some way of detecting corrupted binary files, otherwise implementing this feature might cause corrupted files to reproduce around the cluster.

pereferrera commented 11 years ago

And it doesn't seem like we can reuse the Thrift transport for sending files between DNodes: http://grokbase.com/t/thrift/user/11cj1traxn/large-file-transfer

pereferrera commented 11 years ago

This is now closed - it has been tested and documented carefully. By now it will be disabled by default, as an experimental feature. It can be enabled by configuration.

One thing not resolved is about disk space: DNodes may run out of space after being copied some new replicas. But I didn't find a satisfying way of solving this. The way would be for DNodes to inform about disk space peridiocally (for example publishing in a centralized data structure in Hazelcast): I think we can open this later as a new issue.

By now, enabling or disabling replica balancing should be enough I think.

datasalt / splout-db

Automatically re-balance under-replicated partitions #12