etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.73k stars 9.76k forks source link

Write txn shouldn't End() on a failure #18679

Open shyamjvs opened 3 weeks ago

shyamjvs commented 3 weeks ago

What happened?

Offshoot of https://github.com/etcd-io/etcd/issues/18667#issuecomment-2394848548

When two nodes both execute the same write txn (i.e validation step already passed on both), but one ends up in failed execution and the other in success, we shouldn't let the failed node increment its CI and move on - but explicitly fail and crash at the old CI - because we never expect a write txn to fail. Otherwise it would cause asymmetric side-effect across nodes and lead to data inconsistency.

Here's where I think our code may be potentially violating that:

https://github.com/etcd-io/etcd/blob/c1976a67172fff1e63ee1b0117d34e8be8c35b45/server/etcdserver/txn/txn.go#L307-L314

If I'm reading that correctly, when we run into a txn failure (which may contain some partially executed writes) the txn is ended before triggering server panic - which means there could be a KV rev and CI bump with some incorrect changes. So if/when the server restarts, it will assume it's caught up (till that badly applied txn) and move on to the next record leading to inconsistent state. Reproducing such failure is going to be hard iiuc but it seems like we should panic without calling txn.End here?

What did you expect to happen?

The etcd node on which write txn execution fails should crash without trying to commit the failed transaction (or other side effects like incrementing KV revision or CI).

Anything else we need to know?

There was no observed failure or a confirmed repro I have for this issue, but bringing it up as a potential risk in our current code.

Please share any thoughts about making the above change/trying to repro or if you think this is a non-issue. /cc @serathius @ahrtr

serathius commented 3 weeks ago

Context why this was introduced https://github.com/etcd-io/etcd/pull/14149

shyamjvs commented 3 weeks ago

As I read through that change, it seems the goal was to make timeouts of read-only txns not cause panic. But the change also makes undesirable side-effect of letting the partial-write end before panic.

shyamjvs commented 3 weeks ago

This comment from @ptabor was quite relevant actually (the miss seems to be to not panic right away):

I have small preference towards apply workflow being extremely strict and deterministic and any unexpected error causing premature exit from function (even in RO) code is something that should lead to investigation and fixing.

@ahrtr - please share any add'l context you might have from that PR - or I can make a change to remove txn.End and test

ahrtr commented 3 weeks ago

Thanks @shyamjvs for the good catch. Right, we need to remove the txnWrite.End(), otherwise the data might be partially committed into the backend storage (bbolt) before panicking. cc @lavacat

https://github.com/etcd-io/etcd/blob/f1aefa5e90f1c69236a0c71a7d37814aa538fcc4/server/etcdserver/txn/txn.go#L310-L311

serathius commented 3 weeks ago

Looks like a good place for a failpoint.

shyamjvs commented 2 weeks ago

Sorry this took a while, I sent out a fix above.

@serathius I'm happy to add failpoints too, but had a few questions first around how to make them effective. Could we discuss over a call (maybe next SIG sync)?

serathius commented 2 weeks ago

See example in https://github.com/etcd-io/etcd/pull/17555

ahrtr commented 5 days ago

Open to track the backport effort https://github.com/etcd-io/etcd/pull/18749#issuecomment-2439675706