Netflix / dynomite

A generic dynamo implementation for different k-v storage engines
Apache License 2.0
4.2k stars 534 forks source link

Data out of sync issue #649

Closed sekhrivijay closed 5 years ago

sekhrivijay commented 5 years ago

Data sync between nodes and constancy is managed by dynomite and is configurable using Consistency options. There is an auto warming option if the node is down and does not have all the correct data that can be rebuild for other cloud providers.

However, even if a node is not down , a network failure (or some other failure) can occur. Dynomite asyc tries to sync data across nodes and retries on failures. If I have 4 racks running in 2 k8s cluster across 2 regions , I observe the data gets out of sync between racks. This scenarios occurs when one of the nodes is not down per say, but another failure (perhaps network , or pod bounced or cluster reallocation, etc ) happened. In this case the dynomite node that got the write request tries to sycn the data to the node and retries and gives up. There does not seem to be a any queuing mechanism in dynomite that keeps track of what checkpoints all the nodes are in or what failures occurred so it can replay them.

Very simply say n1r1 tries to sync data to n1r2. It fails because of any number reasons. It retries and fails again and gives up. Now n2r2 is out of sync. This causes cascading failures in Netflix conductor also . To avoid this I have started to write a cronjob that keeps tracks of all nodes and its redis keys/offsets every couple of minutes and triggers an custom auto warming if the keys/offsets doesn't match. This seems like building a syncing solution that should have been inherent to dynomite itself. Am I missing something here or is there any better option to keep all the nodes in sync? Notice there is no grantee that the node will be healthy all the time.

smukil commented 5 years ago

@sekhrivijay We're working on natively including a repair process. The following PR is the first step towards that. It's still unsupported so it does have limitations, but over time, we will make it part of the normal Dynomite deployments: https://github.com/Netflix/dynomite/pull/653

Eventually, Dynomite will offer support to self heal across replicas within a region and over time even across regions.