Comcast / sirius

A distributed system library for managing application reference data
http://comcast.github.io/sirius/
Apache License 2.0
298 stars 49 forks source link

Subset of follower nodes missing data #126

Open jsnikeris opened 8 years ago

jsnikeris commented 8 years ago

We recently detected a situation where several follower nodes were found to be missing data that could be found in the uberstore of one of the Paxos-participating nodes. The affected nodes were all in the same data center, but not all nodes in that data center were affected (7 out of 28 in that data center). Further, there seemed to be two sets of affected nodes, similar in the degree to which they were affected (e.g. nodes A, C, F, G were missing 540 of a particular type of event while B, D, E, were only missing 167 of that event). However, all of the missing events took place around the same time.

Our cluster topology involves three data centers and has three parts to it:

The first part is composed of three Paxos-participating nodes, only one of which generates events that go out to the cluster. The other two nodes are for failover. All three nodes are in the same datacenter.

The second part is composed of what we call repeater nodes. Their responsibility is to distribute updates from the Paxos-participating nodes (in a different datacenter) to the client-facing nodes they share a datacenter with. That is to say, the sirius cluster config for repeater nodes lists only the Paxos-participating nodes, and the sirius cluster config for client facing nodes lists only the repeater nodes. There are three repeater nodes in each datacenter.

The third part is composed of the nodes serving customer traffic.

I was able to obtain a copy of the uberstore directory from one of the Paxos-participating nodes (145) and one of the affected nodes (141). Using the waltool, I determined a sequence range that encompassed the missing events and extracted that same range from each uberstore:

As you can see, there are some individual events missing as well as a large chunk that's missing (546935891-546943916)

Some more information about our setup:

Please let me know if there is anything else you would like to know.