Etcd robustness tests utilize new LazyFS feature to simulate power failure

serathius commented 1 year ago

What would you like to be added?

Followup from https://github.com/etcd-io/etcd/issues/16596

Why is this needed?

Improve etcd resiliency

serathius commented 1 year ago

cc @mj-ramos

serathius commented 1 year ago

I have read through the new LazyFS feature and found one problem. Robustness tests assume that they can control when the failure is injected. Reason is that we are testing etcd behavior correctness, not only whether data written to disk is consistent.

Example scenario for KillFailpoint :

We start etcd
Run set of traffic (read, write etc)
We crash etcd
We clear lazyfs cache (run echo "lazyfs::clear-cache" > /tmp/faults.fifo)
We restart etcd
We validate etcd state is consistent from before the start.

What we would need from LazyFS is to implement a way to inject a "reorder" or "split_write" at an arbitrary time, similar to how clear cache is invoked by via a unix socket. I also see that there is a new comment lazyfs::crash however I'm not sure how the parameters should be used.

@mj-ramos would it be possible to allow injecting "reorder" or "split_write" via a unix socket?

Also could you provide an example how to inject lazyfs::crash imminently to LazyFS?

mj-ramos commented 1 year ago

Hello! We are thrilled about the opportunity to contribute to etcd testing. In this discussion, I will explain why we have chosen to use a configuration file and present some solutions for integrating LazyFS into etcd testing.

The three newly introduced fault types differ significantly from the clear-cache fault:

crash: This fault allows the crashing of the file system either after or before the next specified system call for a given path. This fault is injected through the FIFO, similar to clear-cache. However, both faults have the drawback of being asynchronous, meaning that the exact moment when LazyFS acknowledges these faults is undefined.

Here is an example of usage: echo "lazyfs::crash::timing=before::op=write::from_rgx=0000[0-9].log" > /my/path/faults.fifo This means that LazyFS will crash itself before the next write system call it receives where the specified path matches the regex 0000[0-9].log. So, after acknowledging this fault, if a write for the path 00003.log is issued, LazyFS will crash before processing that write. We induce a crash in LazyFS to prevent further system calls from being issued by the application, thereby simulating power faults within a narrow time interval. For a better explanation one can read the chapter 4.3.1.3 of this document document.
reorder and split_write: These two fault types are configurable exclusively through the configuration file because they refer to specific system calls. For example, in the case of split_write, if an application exhibits the following trace:

write(path="hello.tmp",size=30,off=0) 
fsyn(path="hello.tmp")
rename(from="hello.tmp",to="hello.txt")
write(path="hello.txt",size=30,off=30)
write(path="hello.txt",size=5,off=60)
write(path="hello.txt",size=10000,off=65)

One might want to "split" the last write into two smaller ones and just persist one of these. In such case, LazyFS will be configured to split the 3th write issued to the file hello.txt (counting from the beginning of LazyFS's execution).

With that said, we can incorporate support for injecting these two fault types through the FIFO. There are two possible scenarios:

We can follow the same logic as the configuration file (i.e., inject fault at the nth write or group of consecutive writes). When LazyFS receives the fault through the FIFO, this marks the point when LazyFS starts counting the writes and groups of consecutive writes.

If one wants to be sure that LazyFS accounts for all etcd write operations, the fault can be sent through the FIFO right before starting etcd. We can even create another FIFO where we write information indicating that the fault has been acknowledged by LazyFS. etcd can then read from this FIFO and initiate its execution after that.

Alternatively, we can introduce a command to inject the split fault into the next big write or the reorder fault into the next group of consecutive writes. Nevertheless, it is essential to know that in between the time elapsed writing this fault to the FIFO and LazyFS processing and injecting the fault there might happen "big writes" and groups of consecutive writes. This could potentially lead to difficulties in reproducing tests.

serathius commented 1 year ago

I understand that using a on demand failure injection creates two problems as you noted:

Asynchronous - the injection request just set ups the failpoint, time of triggering it depends on application itself.
Not reproducible - We cannot rerun the test so the failpoint triggers on the same write operation.

Are there any problems I missed?

I don't think either are a problem for etcd robustness tests. See https://github.com/etcd-io/gofail library used by etcd to inject failpoints on critical code paths. It allows us to setup failpoints in both modes. On initial start via environment variables and on demand via http request.

In robustness tests we use the on the on demand mode. I would recommend having similar parity for modes of triggering failpoints in LazyFS.

To go over how we handle those problems, asynchromous is ok as we inject an etcd panic and wait for etcd process to crash within some expected time. Similar think can be done for LazyFS, we could setup a crash and wait for LazyFS process to exit.

As for reproducability, etcd robustness tests already don't have 100% reproducability as we are verifying parallel operations which we cannot guarantee order of execution. It's more important for us to know and control the timing of the failure injection, then having it be repeatable but happen at unknown time. In the report but you sent, you injected failure on first write. It's nice feature to inject failure on exact first write, but not very practical as most databases run for days or weeks without downtime, so testing the initialization is not the main concern.

As for the crash failpoint, my only concern is what the etcd database will see after LazyFS crash. Will the FUSE mount disappear and the resulting directory be empty? Or will the syscalls start failing and disk operations block indefinitely?

mj-ramos commented 10 months ago

Hi, I apologize for the late reply. I've been quite busy lately.

I get the idea, and it is doable. We will consider the introduction of such possibility in LazyFS in the near future.

When LazyFS crashes, the mount point becomes inaccessible, but the files are preserved in the root point. Attempting to execute system calls at this point will result in errors such as "Transport endpoint is not connected." In the case of etcd v3.4.25, for example, it stops executing upon encountering these errors as it cannot access its database.

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

etcd-io / etcd

Etcd robustness tests utilize new LazyFS feature to simulate power failure #16597

What would you like to be added?

Why is this needed?