basho / riak_repl

Riak DC Replication
Apache License 2.0
56 stars 32 forks source link

REDACTED realtime replication issue [JIRA: RIAK-1834] #685

Open binarytemple opened 9 years ago

binarytemple commented 9 years ago

REDACTED are seeing some strange behavior around the realtime replication queue. Perhaps someone could help interpret/explain this behavior/suggest a mitigation strategy if appropriate?

Observations:

A: realtime_queue_stats.bytes for prod01 increases until it reaches the max value when it rapid drops to zero. B: The rate for both REDACTED and REDACTED is proportional to the traffic to nodes. Around 6PM the rate slows as the traffic slows down. C: Because of the slow rate REDACTED will not reach the max until tomorrow which means that we will have the problem all night instead of the usual 2 hours.

Questions:

1: Why are realtime_queue_stats.bytes dropping to zero the it reaches the max? 2: Can we force the behavior that causes it to drop to zero thereby fixing the problem? 3: Is there some large object the can not be replicated which is only dropped when the queue reaches its max size? 4: How can we replicate the object? Changed options? 5: Can it be dropped sooner? 6: Is there a way to inspect this troublesome object that can't be replicated?

bsparrow435 commented 9 years ago

Added my comments to the ticket. Pasting here as well:

I read through the logs from the incidents and this looks to be a large object + possible network congestion causing repeated heartbeat timeouts. This behavior is being investigated by a few folks but for now a good mitigation is to increase the heartbeat timeout to 60s or even higher. You can also disable the heartbeat as well.

Questions:

1: Why are realtime_queue_stats.bytes dropping to zero the it reaches the max? A: Nothing special happens when we reach max bytes other than logging a message any time we attempt to push something on to the real time queue. Reading over this ticket, they note at least one instance where things cleared up before max_bytes was reached. My hypothesis would be that this behavior is a red herring and this is more likely a heartbeat timeout caused by large objects rather than a yet-unseen bug around max bytes.

2: Can we force the behavior that causes it to drop to zero thereby fixing the problem? A: Every time something is added to the rtq we call trim_q which calls trim_q_entries if the current queue size + size of incoming object are greater than max bytes(https://github.com/basho/riak_repl/blob/develop/src/riak_repl2_rtq.erl#L622). We eventually delete the first entry(which is the oldest queue entry) from the queue and move on. There is no special logic here which would explain an unresponsive real time queue(https://github.com/basho/riak_repl/blob/develop/src/riak_repl2_rtq.erl#L665).

3: Is there some large object the can not be replicated which is only dropped when the queue reaches its max size? A: Yes, this is absolutely possible. Objects at the head of the queue will retry until they are successful or are removed from the queue by trim_q. If this was the case, you'd see repeated heartbeat timeouts until the queue was full and then activity again. A quick way to check this would also be to disable the heartbeat logic in realtime and observe normal operations.

4: How can we replicate the object? Changed options? A: Once an object is dropped from the rtq, it will not be replicated unless a fullsync is run or another put operation is performed against the object.

5: Can it be dropped sooner? A: There is currently no way to drop a specific entry from the queue. You can drop the entire queue but not individual entries. This would require a code change to support as the rtq ETS table is private and can only be accessed by the PID holding the ref.

6: Is there a way to inspect this troublesome object that can't be replicated? A: Yes, i'd expect to see log messages on the sink cluster for the blocked rtq regarding a large object being written repeatedly. Alternatively, during a queue buildup scenario, you can call riak_repl2_rtq:dumpq(). which will return an ordered list of entries in the RTQ. The problem object should be at the head or tail of this list(can't remember which is the head from the ets table, would need to look at the ets:tab2list code).

tl;dr Recommend increasing heartbeat timeout or disabling all together to see if backup scenario persists. Problem object can be identified with sink side logs or with riak_repl2_rtq:dumpq().

Thanks, Brian

jonmeredith commented 9 years ago

In the RTQ stuff, I believe the implementation will add the entry, then immediately trim it (derp) if it's larger than the max bytes. So if you had a stream of objects including one larger than the max bytes, it would be added and trimmed before rtq:status() could run, ending with an empty queue as it had to flush everything to get rid of the large object.

The problem with trimming the queue is you don't know why the object failed

For debugging it could be useful to have some stats that run against the RTQ - min, max, mean object sizes and top-10 largest objects/keys or something like that.

On Wed, May 27, 2015 at 10:39 PM, Brian Sparrow notifications@github.com wrote:

Added my comments to the ticket. Pasting here as well:

I read through the logs from the incidents and this looks to be a large object + possible network congestion causing repeated heartbeat timeouts. This behavior is being investigated by a few folks but for now a good mitigation is to increase the heartbeat timeout to 60s or even higher. You can also disable the heartbeat as well.

Questions:

1: Why are realtime_queue_stats.bytes dropping to zero the it reaches the max? A: Nothing special happens when we reach max bytes other than logging a message any time we attempt to push something on to the real time queue. Reading over this ticket, they note at least one instance where things cleared up before max_bytes was reached. My hypothesis would be that this behavior is a red herring and this is more likely a heartbeat timeout caused by large objects rather than a yet-unseen bug around max bytes.

2: Can we force the behavior that causes it to drop to zero thereby fixing the problem? A: Every time something is added to the rtq we call trim_q which calls trim_q_entries if the current queue size + size of incoming object are greater than max bytes( https://github.com/basho/riak_repl/blob/develop/src/riak_repl2_rtq.erl#L622). We eventually delete the first entry(which is the oldest queue entry) from the queue and move on. There is no special logic here which would explain an unresponsive real time queue( https://github.com/basho/riak_repl/blob/develop/src/riak_repl2_rtq.erl#L665 ).

3: Is there some large object the can not be replicated which is only dropped when the queue reaches its max size? A: Yes, this is absolutely possible. Objects at the head of the queue will retry until they are successful or are removed from the queue by trim_q. If this was the case, you'd see repeated heartbeat timeouts until the queue was full and then activity again. A quick way to check this would also be to disable the heartbeat logic in realtime and observe normal operations.

4: How can we replicate the object? Changed options? A: Once an object is dropped from the rtq, it will not be replicated unless a fullsync is run or another put operation is performed against the object.

5: Can it be dropped sooner? A: There is currently no way to drop a specific entry from the queue. You can drop the entire queue but not individual entries. This would require a code change to support as the rtq ETS table is private and can only be accessed by the PID holding the ref.

6: Is there a way to inspect this troublesome object that can't be replicated? A: Yes, i'd expect to see log messages on the sink cluster for the blocked rtq regarding a large object being written repeatedly. Alternatively, during a queue buildup scenario, you can call riak_repl2_rtq:dumpq(). which will return an ordered list of entries in the RTQ. The problem object should be at the head or tail of this list(can't remember which is the head from the ets table, would need to look at the ets:tab2list code).

tl;dr Recommend increasing heartbeat timeout or disabling all together to see if backup scenario persists. Problem object can be identified with sink side logs or with riak_repl2_rtq:dumpq().

Thanks, Brian

— Reply to this email directly or view it on GitHub https://github.com/basho/riak_repl/issues/685#issuecomment-106167654.

Jon Meredith Chief Architect Basho Technologies, Inc. jmeredith@basho.com

bsparrow435 commented 9 years ago

I agree that stats of the objects in the queue as well as the time it takes an object to make it through the queue would be very interesting and help with debugging these object/use case dependent issues.

binarytemple commented 9 years ago

Tada! As though by magic! https://gist.github.com/binarytemple/31ce133fb703e620c592

binarytemple commented 9 years ago

Is this going to go anywhere, or should I just close it out?