cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

storage: Test behavior when disk fills up #19656

Closed bdarnell closed 6 years ago

bdarnell commented 6 years ago

Prior to #19447, certain disk errors (the most likely being ENOSPC) were not being handled correctly, and we suspect that inconsistent reads could be served after this had happened. We need more testing of our behavior after disk writes have failed.

One way to do this would be a process that alternately writes a file to fill up the disk (or maybe just fallocate()), waits a bit, then deletes the file (and restarts the cockroach process if it crashed). Maybe this would make sense as a new jepsen nemesis.

tbg commented 6 years ago

Adding this to Jepsen might still make sense for the correctness aspect of this, but we have the infra for doing this in roachtest-land available. The test that comes to mind is

  1. start a cluster with some background workload (one of the scaledata correctness tests comes to mind) and with a ballast file on one node
  2. fill up disk on the node with the ballast file
  3. wait until the process crashes
  4. verify that background workload does not stall (#7882)
  5. delete ballast file and restart node
  6. verify that node becomes healthy and participates in cluster again
tbg commented 6 years ago

31187 also added the infra to run on a charybdefs and inject these errors, in case we don't want fallocate.

tbg commented 6 years ago

Folding into https://github.com/cockroachdb/cockroach/issues/7882 (the other way than I originally planned to).