Netflix / dynomite

A generic dynamo implementation for different k-v storage engines
Apache License 2.0
4.19k stars 531 forks source link

Azure TCP timeout #714

Open mmriis opened 4 years ago

mmriis commented 4 years ago

This took a while to figure out.

DC 1 is a VM on Azure. DC 2 is an on-prem VM.

After a little while they got out of sync and connection timeouts occured.

This is due to the way Azure handles TCP connections. All VMs are NATted 1:1 and TCP connections are silently dropped (no TCP RST) after 4 minutes of inactivity.

Dynomite respects Linux tcp keepalive values. But these keepalive probes start after 4 hrs of inactivity. So the connection drops after 4 minutes and Dynomite gets a read timeout after 30 seconds and drops the packets.

To fix it I changed the tcp keepalive settings on the linux vm to the following:

net.ipv4.tcp_keepalive_time = 120
net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 8

This seems to have remediated the issue.