Do something more useful than incr dropped_count on realtime failure?

nathanaschbacher commented 11 years ago

This is a sore point for customers using realtime replication.

The conversation usually goes like this:

Basho: "Use riak-repl status to monitor for replication failures by checking dropped_count" Customer: "How does that work?" Basho: "When an object fails to replicate you'll see that counter increase." Customer: "Will Riak retry it automatically, or do I have to manually handle the failed objects?" Basho: "Neither. Riak won't retry it automatically and you'll also have no way of knowing which object failed, so you won't be able to handle it manually." Customer (looking confused): "So then how do I repair a realtime replication failure?" Basho: "You have to wait for your next successful full-sync." Customer: "How long does that take?" Basho (said as deadpan as possible): "Well, possibly hours or days." Customer: "Ha! No really, how do I... er... oh... wait... you're serious." Basho: "... ..."

We're going though the trouble of incrementing a counter on failure. What can we do to improve this monitoring and recovery scenario? At a bare minimum logging the failed key would be a vast improvement so the customer could at least attempt to deal with failures proactively. Ideally these failures would be journaled and retried automatically until some threshold for permanent failure is reached and then logged so the customer can elect to deal with it.

At any rate, this is a common enterprise customer pain point, and it seems eminently improvable if not completely curable.

Can we do anything about this?

jonmeredith commented 11 years ago

It's software, we can do things about it.

With the BNW replication if the WAN link is down, we'll buffer in memory up to the limit of the realtime queue (RTQ) so those will be sent when the date centers reconnect for a brief outage.

If the WAN link is slow, we'll buffer up to the RTQ size then start dropping. We could write to disk at this point, but if the load is constant relative to the performance of the WAN then you'll never catch up.

For longer outages, if we log the bkeys on a heavily loaded cluster logs could get big fast, but perhaps that's something that would useful to people. For frequently updated keys they would have multiple entries in the logs.

One option we have is spilling the in-memory RTQ on to the disk, but then you're just adding additional I/O load. We're working on integrating the merkle trees built for the AAE system into replication so fullsync duration should become proportional to the number of differences between replicas rather than the current proportional to the keyspace size. That work is targeted to land in a release after 1.4, although we expect to have a prototype running soonish.

nathanaschbacher commented 11 years ago

Do we buffer the whole object into the RTQ or just the keys?

Even if you never catch up you could clear the disk log on full-sync regardless, but that's really orthogonal to at least logging some detail about what specifically failed.

There's certainly potentially additional I/O load, but how much is it realistically? There are dozens of systems with full transaction logs that handle thousands of ops/sec, and RT-MDC failures don't happen with that kind of frequency from my understanding, so my (albeit naive) assumption is that we shouldn't be incurring a significant penalty, and this functionality could be presumably turned on/off in the case the customer shares this concern?

Even if implementation details for recovery are ultimately pushed off to the customer, they will actually have a route to proactively recover if they at least know what failed. It might also allow for spotting and analyzing of problematic objects in the cluster that are constant culprits for failure.

It seems like a relatively low-hanging fruit to alleviate a pain point for paying customers and gives us a better story for enterprise prospects.

michellep-basho commented 11 years ago

We had an issue last week with AT&T's mHealth environment around dropped objects. Randy, Brent and I drafted an interim remediation plan, but we would also like to see a longer-term less hacky solution. thanks

bookshelfdave commented 11 years ago

I thought I would mention the dropped object hooks that we implemented in Riak 1.2 for Best Buy: https://github.com/basho/riak_repl/commit/8ccfbefa475c8e2fa4ac45bb4e17e651b54d331e

This is a 1.3 "Default mode" only feature (it doesn't work with Advanced mode yet, but we've talked about adding them for 2.0 to keep Best Buy happy).

To use: Add an app.config config setting under riak_repl , dropped_hook that takes a module:function /1. There's a default implementation that uses lager to print out the dropped object: https://github.com/basho/riak_repl/commit/8ccfbefa475c8e2fa4ac45bb4e17e651b54d331e#L2R591

It's undocumented, as printing dropped objects out to logs could lead to terribly inefficient replication.

cmeiklejohn commented 10 years ago

Moving to 2.1.

basho / riak_repl

Do something more useful than incr dropped_count on realtime failure? #238