basho / riak_repl

Riak DC Replication
Apache License 2.0
56 stars 32 forks source link

Repair behaviour - AAE fullsync #773

Open martinsumner opened 7 years ago

martinsumner commented 7 years ago

There are two potentially curious aspects of repair behaviour with AAE full-sync. This may be a false reading of the code but:

For the first part this appears to be a consequence of not storing the clocks in the AAE store, just the hashes. So AAE has no way of determining which side is up-to-date. This may require significant change to resolve, so this is a design rather than implementation issue.

For the second part, this is where the bloom is generated - https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L379-L386. The 5% limit is defined here https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L292.

The actual transition between using random reads and a fold is defined here: https://github.com/basho/riak_repl/blob/develop/src/riak_repl_aae_source.erl#L543-L582

So if you have 1M keys in the vnode and 50,001 differences - I think it will fix 50K differences through random reads, and resolve the last difference by creating a bloom and folding over all the objects. As you would expect the differences would be randomly distributed across the segments of the AAE tree, it does seem plausible that the decision could be made earlier (perhaps after a sample of 1000 random reads), that the 5% limit is likely to be breached - and the bloom approach invoked.