MDC realtime replication heartbeats wait behind large objects being sent [JIRA: PMT-161]

basho / basho_docs

Basho Products Documentation

http://docs.basho.com

Other

169 stars 190 forks source link

MDC realtime replication heartbeats wait behind large objects being sent [JIRA: PMT-161] #1062

Open tamsky opened 10 years ago

tamsky commented 10 years ago

Currently, if a large object is being sent via MDC to a remote sink, any heartbeat messages sent from the source to the same sink will not be received until after the large object has been completely received.

Possible remedies:

Heartbeat messages could be sent on a separate TCP connection.
Heartbeat messages could be sent using UDP.
Heartbeat messages that would block if flushed to the socket layer should invoke an out-of-bound messaging protocol (read TCP urgent mode), to allow the heartbeat to take priority over any existing queued traffic.

Admittedly, basho currently advises a 5MB maximum object size when using realtime MDC replication, but even this 5MB object size could still be too large if the connection between the source/sink degrades, even slightly. (Limited bandwidth, increased latency, or increased packet loss will all tend to decrease the minimum size of an object that could cause heartbeats to be lost.)

jonmeredith commented 10 years ago

The heartbeat is deliberately in the message sending chain so that we can detect software errors as well as network errors. Large objects definitely cause issues - the only real workaround at the moment is to increase the timeout from the default 15s.

To improve the system, it may make more sense to add a periodic ping from the realtime sink and use that as the primary mechanism for keeping connections alive (if we're still getting acks or pings, something good is happening) and use the RTT as measured by the heartbeat for stats purposes.

tamsky commented 10 years ago

Thanks for the prompt reply.

Why or how the current 15s was chosen, or what the gain/loss tradeoff is it is increased, is not described anywhere in the current docs (see #1061).

Another possible improvement/workaround would be to make the source/sink protocol channelized (think ssh's ability to multiplex channels over one connection).

jonmeredith commented 10 years ago

The 15 second was a SWAG based on early adoption. We should document how to select it better.

In the v3 release we actually unchannelized from V2. Based on some experiments we cannot run sockets at line speed with the current vm, so keeping them separate gives greater possible bandwidth and simplifies the design.

On May 5, 2014, at 7:57 PM, tamsky notifications@github.com wrote:

Thanks for the prompt reply.

Why or how the current 15s was chosen, or what the gain/loss tradeoff is it is increased, is not described anywhere in the current docs (see #1061).

Another possible improvement/workaround would be to make the source/sink protocol channelized (think ssh's ability to multiplex channels over one connection).

— Reply to this email directly or view it on GitHub.

slfritchie commented 10 years ago

@jonmeredith Is there any value in splitting up big single "packet" writes into a series of smaller packets, e.g., 8KByte? Then when a single 8KByte "packet" is received, the timer can be reset. The timer could still expire when receiving the 8/whatever KBytes on a really congested/slow link, but ... since the VM's inet_drv can't tell us when it was that the last byte was received on socket, a useful liveness metric isn't available.

If two sockets are used, one for replication bulk data and one only for heartbeats (and perhaps other control PDUs), how many customer problems in our past have shown enough congestion to delay such a channel?

DSomogyi commented 9 years ago

Comment for Jira.