Open shyamjvs opened 3 weeks ago
Context why this was introduced https://github.com/etcd-io/etcd/pull/14149
As I read through that change, it seems the goal was to make timeouts of read-only txns not cause panic. But the change also makes undesirable side-effect of letting the partial-write end before panic.
This comment from @ptabor was quite relevant actually (the miss seems to be to not panic right away):
I have small preference towards apply workflow being extremely strict and deterministic and any unexpected error causing premature exit from function (even in RO) code is something that should lead to investigation and fixing.
@ahrtr - please share any add'l context you might have from that PR - or I can make a change to remove txn.End and test
Thanks @shyamjvs for the good catch. Right, we need to remove the txnWrite.End()
, otherwise the data might be partially committed into the backend storage (bbolt) before panicking. cc @lavacat
Looks like a good place for a failpoint.
Sorry this took a while, I sent out a fix above.
@serathius I'm happy to add failpoints too, but had a few questions first around how to make them effective. Could we discuss over a call (maybe next SIG sync)?
See example in https://github.com/etcd-io/etcd/pull/17555
Open to track the backport effort https://github.com/etcd-io/etcd/pull/18749#issuecomment-2439675706
What happened?
Offshoot of https://github.com/etcd-io/etcd/issues/18667#issuecomment-2394848548
When two nodes both execute the same write txn (i.e validation step already passed on both), but one ends up in failed execution and the other in success, we shouldn't let the failed node increment its CI and move on - but explicitly fail and crash at the old CI - because we never expect a write txn to fail. Otherwise it would cause asymmetric side-effect across nodes and lead to data inconsistency.
Here's where I think our code may be potentially violating that:
https://github.com/etcd-io/etcd/blob/c1976a67172fff1e63ee1b0117d34e8be8c35b45/server/etcdserver/txn/txn.go#L307-L314
If I'm reading that correctly, when we run into a txn failure (which may contain some partially executed writes) the txn is ended before triggering server panic - which means there could be a KV rev and CI bump with some incorrect changes. So if/when the server restarts, it will assume it's caught up (till that badly applied txn) and move on to the next record leading to inconsistent state. Reproducing such failure is going to be hard iiuc but it seems like we should panic without calling txn.End here?
What did you expect to happen?
The etcd node on which write txn execution fails should crash without trying to commit the failed transaction (or other side effects like incrementing KV revision or CI).
Anything else we need to know?
There was no observed failure or a confirmed repro I have for this issue, but bringing it up as a potential risk in our current code.
Please share any thoughts about making the above change/trying to repro or if you think this is a non-issue. /cc @serathius @ahrtr