datasalt / splout-db

A web-latency SQL spout for Hadoop.
50 stars 14 forks source link

Automatically re-balance under-replicated partitions #12

Closed pereferrera closed 11 years ago

pereferrera commented 11 years ago

Splout SQL is CAP : because data is indexed offline and served in a read-only fashion, and because we have replication for fail-over, everything inside the CAP theorem is met.

However, we can't say that Splout SQL is exactly "highly" available. It is true that if one machine dies and there are other replicas, the partitions in that replica can still be served. But eventually, if the node is not manually repaired, the partition may run out of replicas (more machines die) and not be available anymore.

The idea is to implement a simple mechanism for transferring partition data from one replica to another (newly designated) replica. So if one partition is under-replicated, a new replica can be created on the fly. Some things to keep in mind:

pereferrera commented 11 years ago

There should be some (configurable) waiting time before initiating a copy, otherwise multiple copies might be triggered at cluster startup, when all partitions will be under-replicated.

pereferrera commented 11 years ago

This is for another issue, but we should have some way of detecting corrupted binary files, otherwise implementing this feature might cause corrupted files to reproduce around the cluster.

pereferrera commented 11 years ago

And it doesn't seem like we can reuse the Thrift transport for sending files between DNodes: http://grokbase.com/t/thrift/user/11cj1traxn/large-file-transfer

pereferrera commented 11 years ago

This is now closed - it has been tested and documented carefully. By now it will be disabled by default, as an experimental feature. It can be enabled by configuration.

One thing not resolved is about disk space: DNodes may run out of space after being copied some new replicas. But I didn't find a satisfying way of solving this. The way would be for DNodes to inform about disk space peridiocally (for example publishing in a centralized data structure in Hazelcast): I think we can open this later as a new issue.

By now, enabling or disabling replica balancing should be enough I think.