etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.62k stars 9.75k forks source link

Etcd robustness tests utilize new LazyFS feature to simulate power failure #16597

Open serathius opened 1 year ago

serathius commented 1 year ago

What would you like to be added?

Followup from https://github.com/etcd-io/etcd/issues/16596

Why is this needed?

Improve etcd resiliency

serathius commented 1 year ago

cc @mj-ramos

serathius commented 1 year ago

I have read through the new LazyFS feature and found one problem. Robustness tests assume that they can control when the failure is injected. Reason is that we are testing etcd behavior correctness, not only whether data written to disk is consistent.

Example scenario for KillFailpoint :

What we would need from LazyFS is to implement a way to inject a "reorder" or "split_write" at an arbitrary time, similar to how clear cache is invoked by via a unix socket. I also see that there is a new comment lazyfs::crash however I'm not sure how the parameters should be used.

@mj-ramos would it be possible to allow injecting "reorder" or "split_write" via a unix socket?

Also could you provide an example how to inject lazyfs::crash imminently to LazyFS?

mj-ramos commented 1 year ago

Hello! We are thrilled about the opportunity to contribute to etcd testing. In this discussion, I will explain why we have chosen to use a configuration file and present some solutions for integrating LazyFS into etcd testing.

The three newly introduced fault types differ significantly from the clear-cache fault:

write(path="hello.tmp",size=30,off=0) 
fsyn(path="hello.tmp")
rename(from="hello.tmp",to="hello.txt")
write(path="hello.txt",size=30,off=30)
write(path="hello.txt",size=5,off=60)
write(path="hello.txt",size=10000,off=65)

One might want to "split" the last write into two smaller ones and just persist one of these. In such case, LazyFS will be configured to split the 3th write issued to the file hello.txt (counting from the beginning of LazyFS's execution).

With that said, we can incorporate support for injecting these two fault types through the FIFO. There are two possible scenarios:

If one wants to be sure that LazyFS accounts for all etcd write operations, the fault can be sent through the FIFO right before starting etcd. We can even create another FIFO where we write information indicating that the fault has been acknowledged by LazyFS. etcd can then read from this FIFO and initiate its execution after that.

serathius commented 1 year ago

I understand that using a on demand failure injection creates two problems as you noted:

Are there any problems I missed?

I don't think either are a problem for etcd robustness tests. See https://github.com/etcd-io/gofail library used by etcd to inject failpoints on critical code paths. It allows us to setup failpoints in both modes. On initial start via environment variables and on demand via http request.

In robustness tests we use the on the on demand mode. I would recommend having similar parity for modes of triggering failpoints in LazyFS.

To go over how we handle those problems, asynchromous is ok as we inject an etcd panic and wait for etcd process to crash within some expected time. Similar think can be done for LazyFS, we could setup a crash and wait for LazyFS process to exit.

As for reproducability, etcd robustness tests already don't have 100% reproducability as we are verifying parallel operations which we cannot guarantee order of execution. It's more important for us to know and control the timing of the failure injection, then having it be repeatable but happen at unknown time. In the report but you sent, you injected failure on first write. It's nice feature to inject failure on exact first write, but not very practical as most databases run for days or weeks without downtime, so testing the initialization is not the main concern.

As for the crash failpoint, my only concern is what the etcd database will see after LazyFS crash. Will the FUSE mount disappear and the resulting directory be empty? Or will the syscalls start failing and disk operations block indefinitely?

mj-ramos commented 10 months ago

Hi, I apologize for the late reply. I've been quite busy lately.

I get the idea, and it is doable. We will consider the introduction of such possibility in LazyFS in the near future.

When LazyFS crashes, the mount point becomes inaccessible, but the files are preserved in the root point. Attempting to execute system calls at this point will result in errors such as "Transport endpoint is not connected." In the case of etcd v3.4.25, for example, it stops executing upon encountering these errors as it cannot access its database.

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.