SovereignCloudStack / openstack-health-monitor

Script to monitor wheather an OpenStack cloud is working correctly
https://scs.community/
Other
13 stars 4 forks source link

Feature request: Measure IO bandwidth & latency #173

Open garloff opened 6 months ago

garloff commented 6 months ago

We could do something like fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12 and report (average) Bandwidth, IOPs and the percentage of I/O latency > 10ms. The results could end up in the influxdb/grafana (and of course be reported to the console).

Nils98Ar commented 5 months ago

@garloff Would you rather use the smallest mandatory ssd flavor with export JHFLAVOR="SCS-2V-4-20s" in the run_*.sh script or create a smaller one (e.g. SCS-1V-2-10s) and use that?

Maybe we should even measure both volumes and local storage performance in the future?

Currently our mean value for fioLat10ms with cinder volume root disk is 1.49 and therefore too high for etcd if I understood you correctly.

Nils98Ar commented 5 months ago

Apparently using a ssd flavor for the jumphost with JHFLAVOR is not enough and it still uses a volume as root disk?

At least the fioLat10ms does not change but it does when I create a ssd flavor instance manually and run the command there:

debian@test-ssd:~$ BENCH=$(cd /tmp; fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12; rm test.?.? 2>/dev/null)
debian@test-ssd:~$ echo "$BENCH" | grep '  lat (msec)' | grep ', 10=' | sed 's/^.*, 10=\([0-9\.]*\)%.*$/\1/'                               
0.01

Compared to a volume root disk instance:

debian@test-ceph:~$ BENCH=$(cd /tmp; fio --rw=randrw --name=test --size=500M --direct=1 --bs=16k --numjobs=4 --group_reporting --runtime=12; rm test.?.? 2>/dev/null)
debian@test-ceph:~$ echo "$BENCH" | grep '  lat (msec)' | grep ', 10=' | sed 's/^.*, 10=\([0-9\.]*\)%.*$/\1/'                              
1.34

What should I do to make the jumphost use the nova disk?

garloff commented 4 months ago

With ~1.5% of writes above 10ms latency, you'll see some spurious leader changes with etcd. Probably not yet breaking it, but not very robust either.

For the JumpHosts, we currently create a volume manually that we use for booting. We don't do this for the normal VMs (although they do get a volume via nova for diskless flavors). I could add an option to NOT do this, so you can measure local disk performance. We could also add more disk measurements by also running fio on some of the normal VMs, not just the jumphosts. (But I don't think you want to have many VMs created with the SSD flavors, so we'd still need this local disk option for the JumpHosts.)

Nils98Ar commented 4 months ago

If I understood correctly with -Z from #184 you are able to switch the disk measurements from volume to local storage disk? Thank you for that!

Would there be an easy way to also implement measuring both volume and local storage disk?

garloff commented 4 months ago

With -Z you disable the manual creation of a volume for the Jump Hosts to boot from. This means that you will get whatever the Jump Host flavor says:

garloff commented 4 months ago

As for measuring both:

Is this what you want?

Maybe we wait for the next generation health monitor from VP12 before adding another three lines...

Nils98Ar commented 4 months ago

Sounds good but for me it would be also okay to wait for the new health mon :)