Comcast / sirius

A distributed system library for managing application reference data
http://comcast.github.io/sirius/
Apache License 2.0
298 stars 49 forks source link

Slow node eventually DDOS' itself attempting to catch up with others #48

Closed joercampbell closed 10 years ago

joercampbell commented 10 years ago

When a single node in a given cluster is slow (in our case, had a network interface operating @ 10Mb/s when its neighbors were operating @ 1Gig) it eventually gets into a state where it starts falling further and further behind other nodes in the same cluster in terms of processing updates. As this single node continues to fall further behind it starts to attempt to 'catchup' with its friends - further exacerbating its existing slowness by DDOSing itself by requesting catch up information from those same friends. This slow node then starts causing queues to build on the other nodes till eventually one or more of the nodes suffers a FULL java GC - which for our installation (50Gig Heap) causes the entire JVM to stop for 2minutes. Causing additional queues to fill and pushing the entire cluster to fall apart.

So: 1) Slow node in a cluster (in our case caused by an interface @ 10mb/s when others are 1Gig) starts falling significantly behind friends with data updates 2) as node falls behind it starts asking friend nodes in the cluster for updates to catch up. 3) This in turn causes the slow node to DDOS of itself by flooding a slow interface with catch up traffic 4) Which then causes queues on the other boxes to start filling with catch up messages bound for the slow node 5) Queues are essentially unbounded eventually landing the entire cluster in a really bad state.

comcast-jonm commented 10 years ago

@joercampbell : Can this now be closed?

joercampbell commented 10 years ago

Yes - we'll be pulling in the 1.2.1 release some time in the nearish future if we have any problems we'll roll a new issue. Thanks.