etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.91k stars 9.78k forks source link

Add out of space failpoint to robustness #18811

Open serathius opened 3 weeks ago

serathius commented 3 weeks ago

What would you like to be added?

Add a failpoint that simulates etcd running out of disk space.

Why is this needed?

Detect issues like https://github.com/etcd-io/etcd/issues/18810

RostakaGmfun commented 2 weeks ago

Hey @serathius, would that be a good first issue? If so, which approach would be preferred:

  1. Add gofail annotations to I/O-related code in storage and implement a failpoint that injects errors there.
  2. Configure test cluster nodes data directory to a smaller size and execute a big enough put operation.
  3. Same as 2, but write an arbitrary file to the data directory without touching etcd API.

It looks like the first approach allows more fine-grained control, and can be arbitrarily timed, but risks missing certain I/O calls. Also won't cover error handling by bbolt.

serathius commented 1 week ago

@RostakaGmfun thanks for your interest. Fact that you already described multiple solutions with different trade-offs shows that it's not very newcomer friendly, but it also shows that you should be able to tackle it.

I think we could start with first option you proposed and iterate if needed. What do you think?

More thoughts about each option:

Ad 1: Exactly as you described, still should be a good for a first iteration. The risk is that our implementation will mismatch the real word scenario, but we don't need to be perfect as long as we show we can reproduce https://github.com/etcd-io/etcd/issues/18810.

Ad 2 I would skip it for option 3. I would recommend to avoid taking dependency on large writes, we already have issues with flakes due to limited performance of robustness tests. For the first issue we found in etcd v3.5 and wanted to reproduce in robustness tests required >1000QPS. Now we sometime are not able to hit 100.

Ad 3 first based on my knowledge setting up a volume mount with limited disk space requires a root permission like sudo mount -t tmpfs -o size=64M tmpfs datadir (at least based on how I reproduced the issue by Googling answers). I would prefer to first consider solutions that avoid that. Robustness tests are complicated enough to onboard, making their setup more complicated is last thing we should go for. Maybe you could figure out a better way to do that?

One more option for me is use FUSE filesystem like https://github.com/dsrhaslab/lazyfs to inject out of disk space errors. We already have integration with lazyfs, however it is resource intensive, limiting our throughout. If we could improve performance, that would be my preferred long term solution, use a ready tool to inject arbitrary disk errors.