Sync issue causing data loss when node comes back online

ricardojmendez commented 9 years ago

Running Elastic Search 1.6 on OSX, from home brew, I have the following situation:

Two node cluster.
Both nodes can be master if the other goes down, all are allowed to store data.
Node 1 goes down (say, I shut down the server). I connect to Node 2 and continue writing data.
At some later point, Node 1 comes back up.
Node 2 discards all its data in favor of the documents that were on Node 1.

It would seem like a bug, do let me know if this is expected to happen for some reason.

markwalkom commented 9 years ago

This is not what should happen. Are you seeing indices/documents go missing?

ricardojmendez commented 9 years ago

Yes. All documents that were inserted on Node 2 while Node 1 was down disappear.

s1monw commented 9 years ago

can you give me some more information ie. configuration of your index, are you using any special features, are you running on the same physical machine etc. etc. Can you reproduce this problem and if so can you switch on debug logging so we can see what's going on?

bleskes commented 9 years ago

@ricardojmendez when node_1 came back online, did you see in the logs that it accepted node2 as a master? I'm looking for something like:

[2015-06-22 20:35:14,433][INFO ][cluster.service          ] [Infant Terrible] detected_master [Hulk 2099][dCuk05JITNCtrKKURPGYsw][Boazs-MBP.local][inet[/192.168.1.178:9300]], added {[Hulk 2099][dCuk05JITNCtrKKURPGYsw][Boazs-MBP.local][inet[/192.168.1.178:9300]],}, reason: zen-disco-receive(from master [[Hulk 2099][dCuk05JITNCtrKKURPGYsw][Boazs-MBP.local][inet[/192.168.1.178:9300]]])

ricardojmendez commented 9 years ago

@s1monw: Not using any special features I'm aware of - it was a vanilla install. Both are running on different machines. I've seen the issue twice, once when testing what would happen, and the second one when I had just moved to work on a laptop for some time, then came back with it to the same network where the original desktop was. I had meant to first back up the database in case it happened again, to have a repro case, but forgot.

I haven't tested since, will see about finding a time window.

@bleskes: On the logs for that day I see it on Node2. I believe Node1 was up before I brought Node2 into the network (I may have restarted it after booting Node1, as iirc I was running it from the terminal).

bleskes commented 9 years ago

thx. I'm confused though - you said "Node 1 goes down" and "At some later point, Node 1 comes back up." . I was referring to the logs at the moment node 1 comes back. I suspect it doesn't see node2 and decides it's alone and re-uses it's local copy as a last resort (albeit it being stale).

ricardojmendez commented 9 years ago

@bleskes: I was jetlagged when they rejoined the network, so looking back right now I can't be 100% sure that Node2 on the network when I booted Node 1 (I had turned wireless off on the plane) which is why I thought to clarify it.

When Node 1 comes back, it does not have any log of finding Node2 as a master. It does have a record of Node2 joining, which matches that of Node2 finding Node1 as a master.

Regarding your scenario: I'd understand why Node 1 would keep a stale copy and not sync, but given that Node2 had data in it, would you expect that Node2 would then dump all new records in favor of Node1's old state?

When Node2 syncs with Node1, do they do it by exchanging transaction logs (or their equivalents) or by just having whomever gets designated as master propagate its current state?

bleskes commented 9 years ago

I'd understand why Node 1 would keep a stale copy and not sync, but given that Node2 had data in it, would you expect that Node2 would then dump all new records in favor of Node1's old state?

It's important to realize that once node1 was elected as master and designated it's own copy as primary, that shard copy will accept (versioned) writes. Those writes may conflict with what ever operations that may have found their way to node2's copy while node1 was down. There are a couple of ways to solve this conflict, almost all of them complex. Things like CRDTs (compromising on functionality to only what CRDTs can do), try to do some kind of last write wins and suffering all the pains of time shifting, offload the pain to the user (repair on read) and so fourth. In ES we chose to keep things simple and rely on the mastership of the master node and it's decision to choose a primary shard. We never try to resolve operations from multiple copies but rather make sure all copies revert themselves to the primary shard (this is similar to algorithms to like RAFT).

When Node2 syncs with Node1, do they do it by exchanging transaction logs (or their equivalents) or by just having whomever gets designated as master propagate its current state?

No it - make sure it will have a true copy of node1's primary.

Last I have to say that 2 is a tricky number in distributed systems as there is no majority. A two node cluster is good for not loosing data due to hardware failures but you have to make sure the nodes will not be able to operate independently (set min_master_nodes to 2). Without this you run the risk of having split brains and thus data loss and other bad things...

I'm going to close this now as we seemed to have found the source of the problem. Do feel free to reopen if you feel differently.

ricardojmendez commented 9 years ago

Thanks. To be absolutely sure we're talking about the same case @bleskes (since there's mention of record versioning), on a two node scenario, if:

1) Node1 and Node2 lose contact with each other. 2) During that time, 2.1) Node1 doesn't modify, insert or delete any records. 2.2) Node2 doesn't modify or delete any existing records, only inserts new ones. 3) When they rejoin, Node1 gets elected as master.

Then even though there are no conflicts or changes on the shared records, and only new records inserted on Node2, then Node2 will discard all its newly created records, given they don't exist on the master.

Is that correct?

bleskes commented 9 years ago

that's correct. At the moment we don't try to do a doc by doc analysis and try to figure what is safe to keep and what should be rejected.

Note that that point 3 is weird in normal procedure. If Node2 is was on and a master, node1 will join it rather than elect it self (in which case everything goes OK). The issue is that node1 was elected as master first and node2 has joined it.

elastic / elasticsearch

Sync issue causing data loss when node comes back online #11789