etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.87k stars 9.78k forks source link

Add fio job file to test disk performance. #10577

Closed matte21 closed 5 years ago

matte21 commented 5 years ago

Disk performance is paramount to Etcd. https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md suggests measuring it with fio. But disk I/O can happen in a lot of different ways and fio is complex to use. For a user who is not experienced with Etcd disk I/O and/or fio, but needs to asses whether its storage lives up to the requirements Etcd has, writing a meaningful fio job file which does I/O in the same way Etcd does is hard.

I think having such a file or at least some guidelines on how to write such a file would be extremely beneficial for the users. There are different disk metrics which are crucial to Etcd (WAL f(data)sync duration, backend commit time). Maybe one file for each metric is needed? Maybe the cli parameters in https://github.com/etcd-io/etcd/issues/10414#issuecomment-455227063 are good candidates? @hexfusion what do you think? In the comment you wrote you wanted to add something similar to the repo.

hexfusion commented 5 years ago

@matte21 are you asking for an example of fio usage? Here is an incantation I have used in the past I will add it to the docs unless you would like to or if you have a better version feel free to improve mine.

fio --randrepeat=1 \
  --ioengine=libaio \
  --direct=1 \
   --gtod_reduce=1 \
   --name=etcd-disk-io-test \
   --filename=etcd_read_write.io \
   --bs=4k --iodepth=64 --size=4G \
   --readwrite=randrw --rwmixread=75

Does this answer your question?

hexfusion commented 5 years ago

In general I agree I should of done it a while ago, thanks for the reminder.

matte21 commented 5 years ago

@hexfusion I am measuring Etcd performance (we're using SSDs) and seeing that both backend commit and WAL f(data)sync duration are above recommended thresholds reported at https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-apply-entries-took-too-long-mean and https://github.com/etcd-io/etcd/blob/master/Documentation/faq.md#what-does-the-etcd-warning-failed-to-send-out-heartbeat-on-time-mean, and was looking for a fio job file to benchmark those. But the point of the issue was abstracting from my personal use case and have some fio job files added to the docs. I would have done it myself if I was able to, unfortunately I am very inexperienced with fio, disk I/O and Etcd.

hexfusion commented 5 years ago

No problem at all I will add this now

MikeSpreitzer commented 5 years ago

@hexfusion : we wrote up something like what we think is needed. See https://www.ibm.com/blogs/bluemix/2019/04/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd/

hexfusion commented 5 years ago

@MikeSpreitzer thanks for doing this I am excited to read it over the weekend, l will think on where to best link this from the docs but if you have a vision please open PR and we can add.

MikeSpreitzer commented 5 years ago

Any news here?

matte21 commented 5 years ago

I opened a PR: https://github.com/etcd-io/etcd/pull/10685

cgwalters commented 4 years ago

@hexfusion : we wrote up something like what we think is needed. See https://www.ibm.com/blogs/bluemix/2019/04/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd/

This link seems to be broken now.

MikeSpreitzer commented 4 years ago

See https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd

cgwalters commented 4 years ago

Thanks. So...there's one huge discrepancy between https://github.com/etcd-io/etcd/issues/10577#issuecomment-475624306 and that blog entry, which is --direct=1 in the former and not the latter. Does etcd really use O_DIRECT? It doesn't look like it to me. Using O_DIRECT (or not) has a lot of implications.

matte21 commented 4 years ago

Does etcd really use O_DIRECT?

I don't remember for sure. But the fio parameters in the blog entry produce a disk I/O which is much more similar to etcd's than the fio parameters in #10577 (comment) (at least that was the case when we wrote the blog post). The fio parameters in the blog post were derived by comparing the system calls traces of fio and etcd and by trying to make them as similar as possible in the parts that affect disk I/O. I clearly remember that using #10577 (comment) the system calls trace portion describing disk I/O was significantly different than etcd's. So I'd say if --direct=1 is missing from the blog entry you should not use it (re-added warning: this was true some time ago).

ml0renz0 commented 11 months ago

See https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd

This link is also broken, I'll leave here an archive.org link just in case someone comes looking for it as me: https://web.archive.org/web/20210527090640/https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd