etcd test write timeout

MasseGuillaume commented 2 years ago

I think the etcd specification is incorrect:

https://github.com/anishathalye/porcupine/blob/7cbc8342681863b08e96432402d0588fc1defa88/porcupine_test.go#L151-L152

A write can timeout https://github.com/anishathalye/porcupine/blob/7cbc8342681863b08e96432402d0588fc1defa88/test_data/jepsen/etcd_000.log#L61

In this case, the Step function should not apply the write

anishathalye commented 2 years ago

The specification is correct. When the server receives a write operation, it performs the write.

From the client's perspective, an operation can time out. At that point, there are two possibilities, because the client can't tell whether the request was dropped by the network before reaching the server, or the request was received and processed but the response was dropped by the network before reaching the client. After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

How do we represent timed-out write operations as part of the history? They are operations with a known invocation time, but we have no upper bound on when the response was received (or whether the operation was received at all), which we can model with an upper bound of infinity. (If the operation wasn't received at all, then we can choose its linearization point to be after all reads, at which point it doesn't matter how we order writes.)

Here is where we handle all the write/CAS operations that time out, by appending their return (with unknown value) to the end of the history: https://github.com/anishathalye/porcupine/blob/master/porcupine_test.go#L285-L287

MasseGuillaume commented 2 years ago

After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

Great observation. I added an assertion to make sure that a process does not perform any operations after a timeout and it looks like the test dataset follows that rule.

~~Maybe the state could be marked as unknown after a write timeout. This way, the next time we have a successful write we can catch up. Read and CAS should be dropped until the next successful write.~~

MasseGuillaume commented 2 years ago

I found some counterexamples:

After a timed-out write operation, the client performs no more operations (you can verify this in that example history, client 4 does no more operations after that point).

Great observation. I added an assertion to make sure that a process does not perform any operations after a timeout and it looks like the test dataset follows that rule.

https://github.com/anishathalye/porcupine/blob/7cbc8342681863b08e96432402d0588fc1defa88/test_data/jepsen/etcd_100.log#L20-L31

also in etcd_101.log and etcd_102.log

MasseGuillaume commented 2 years ago

How do we represent timed-out write operations as part of the history? They are operations with a known invocation time, but we have no upper bound on when the response was received (or whether the operation was received at all), which we can model with an upper bound of infinity. (If the operation wasn't received at all, then we can choose its linearization point to be after all reads, at which point it doesn't matter how we order writes.)

Ok I understand now, setting a realy large return time does make sense since it would model both cases: a) the write was performed, b) the write was not performed and it's too far in the future to invalidate it with a read. Thanks a lot!

Ah another clarification, you only parse for timeout read https://github.com/anishathalye/porcupine/blob/7cbc8342681863b08e96432402d0588fc1defa88/porcupine_test.go#L203-L209

Therefore, all timeout write and timeout cas are appended at the end, after all read that are actually parsed and appended at the correct position. https://github.com/anishathalye/porcupine/blob/7cbc8342681863b08e96432402d0588fc1defa88/porcupine_test.go#L285-L287

anishathalye / porcupine

etcd test write timeout #10