axboe / liburing

Library providing helpers for the Linux kernel io_uring support
MIT License
2.75k stars 395 forks source link

io-wait used 100% cpu after 6.1.39 and 6.5.1 #943

Open beldzhang opened 11 months ago

beldzhang commented 11 months ago

after this commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.1.39&id=f32dfc802e8733028088edf54499d5669cb0ef69 a running io-uring object will cause one cpu showing 100% usage of io-wait, in my environment 8 rings/threads make 8 cpu got 100% io-wait usage. beside this nothing else found.

try to revert this commit from 6.1.39 then everything is ok. 6.1.51/6.5.1 it is the same, can not direct revert this commit so not tested yet.

all following test got the same performance result: 6.1.38 6.1.39 6.1.39(reverted) 6.1.51

re-produce: use this testing program: https://github.com/axboe/liburing/files/9571382/issue-643-v2.zip, build and just run server, it will showing io-wait 100% on one cpu.

redbaron commented 11 months ago

isn't IO wait time is what you want to see? That is time app is blocked waiting on io (at least it was the case for file IO), but CPU has nothing else to do, so that time is accounted as IO wait.

isilence commented 11 months ago

Note that io-wait doesn't burn CPU cycles, it's sleeping and so it's not a problem apart from reporting. There was a change doing that, and I think it actually makes more sense reporting a task waiting for io_uring completions as io-wait

axboe commented 11 months ago

Yes this is expected. iowait literally just means "waiting on IO", which is what the task is doing. It does NOT mean that it's busy 100% of the time, in fact if you have 100% iowait it means 1 task is sleeping waiting on IO 100% of the time.

beldzhang commented 11 months ago

Yes this is expected. iowait literally just means "waiting on IO", which is what the task is doing. It does NOT mean that it's busy 100% of the time, in fact if you have 100% iowait it means 1 task is sleeping waiting on IO 100% of the time.

noticed that, when this happen, system load is only 0.0x

we are using io-uring in a storage service, so is very sensitive about the storage/network load, previously io-wait is a good indicator to check this. and before kernel 5.0, iostat's %util value can be used also, but after 5.0, this number will easily up to 100% even on a small load. many articles also said this number is not reliable. is there any other way to check the disk load?

and I calculate a performance score after each test based on total cpu usage, just ignore the io-wait part looks like not a good solution...

beldzhang commented 11 months ago

looks no more comments, closed.

romange commented 9 months ago

@axboe, one of Dragonfly's users, also reported this as a behavioural change: Dragonfly, which does not use disk IO, bumps up the IOWAIT metric to 100%. If they run it using the epoll API, it does not affect IOWAIT. I am just double checking whether this change is indeed intended.

rickytato commented 9 months ago

I see same issue with Kernel 6.5.11 (on Proxmox) image

beldzhang commented 8 months ago

emmm... reopen?

RX14 commented 6 months ago

IOwait has been traditionally thought of as "waiting for disk io", which will always complete. Since io_uring can be used to wait on the network, which has unbounded waiting time, it changes the metric considerably. For example, many monitoring systems have alerts for iowait being high, correctly or not assuming it to be a proxy for disk contention.

axboe commented 6 months ago

Here's what I think we should do:

1) Default to not using iowait, as it is indeed somewhat confusing for networked or mixed network/storage workloads. I do think iowait is an awful metric that makes very little sense for async workloads, even just pure storage based ones. Lots of consumers will assume it's busy time, or has a direct correlation with disk usage, which is just wrong. 2) Add an IORING_ENTER_IOWAIT flag that can be used in conjunction with IORING_ENTER_GETEVENTS. If set, iowait will be used. Storage can use this, if they so wish. 3) Add an IORING_FEAT_IOWAIT flag, which tells the app/liburing that this feature is available. 4) Add liburing helpers, ala io_uring_set_iowait() and io_uring_clear_iowait(), which can be used to toggle this flag, if IORING_FEAT_IOWAIT is set. Storage based workloads can set this.

And that should be it. That gives the app control over whether iowait should be used or not.

beldzhang commented 6 months ago

will test when avaliable

isilence commented 6 months ago

@axboe, I just mentioned it in the mailing list, but even though I don't understand why people are taken aback by hi iowait from io_uring waiting, but I think we should just revert that change, there has been too many reports from different people regarding this one. We should be able to do the optimisation that was the reason for the change without reporting iowait.

axboe commented 6 months ago

We can't just revert it, as it solved a real problem. I have my doubts that we can separate the cpufreq side from iowait in a way that would make the scheduler side happy. If we can, I'd be all for it, and would love to see a patch.

romange commented 5 months ago

A great discussion about this topic on lore.kernel.org.

Just for us to understand, once we call io_uring_register_iowait, it will flag networking I/O as iowait but iouring will run in more efficient manner?

Another interesting comment I read is about multiple rings. Currently https://github.com/romange/helio has ring-per-thread architecture. @axboe are you saying that sometimes it makes sense to have two rings? For what use-cases it makes sense?

isilence commented 5 months ago

A great discussion about this topic on lore.kernel.org.

Just for us to understand, once we call io_uring_register_iowait, it will flag networking I/O as iowait but iouring will run in more efficient manner?

The long story. There is a patch upstream since a while ago which does two unrelated things: first it enables some cpu governor optimisation useful for QD1 and not only, and it also changes the io-wait stat behaviour as per this thread. They're coupled together for implementation reasons, it's much easier going this way. So, the optimisation is already in the kernel and always enabled, let's say it's a free lunch. Now, that io_uring_register_iowait() patch would disable the optimisation by default and turn it back on only if you call the function.

I have to say that it's quite a horrendous approach, having side effects from seemingly an optimisation, mixing responsibilities and levels at what the feature enabled and the iowait stat is observed, and so on. I think the register_iowait patch should never be given the light, at least as far as it mixes things together.

isilence commented 5 months ago

Another interesting comment I read is about multiple rings. Currently https://github.com/romange/helio has ring-per-thread architecture. @axboe are you saying that sometimes it makes sense to have two rings? For what use-cases it makes sense?

IMHO, it doesn't make sense apart maybe from some IOPOLL + normal ring weird cases. However, sometimes it happens (unfortunately). For instance, when a library / framework you use has some io_uring support inside, and then the app creates another rings for its own purposes.

axboe commented 5 months ago

There will be no register iowait, the current pending fixes are here:

https://git.kernel.dk/cgit/linux/log/?h=iowait.2

and will be posted for review soon, so they can get into the 6.10 kernel.

beldzhang commented 5 months ago

https://git.kernel.dk/cgit/linux/log/?h=iowait.2

@axboe brief tested, io-wait is gone, will following up. mailing list followed also.

@isilence to the end users, they are sensitive to the latency and high response time of server, to sys admin, io-wait and load are directly showing the situations. generally storage parts is slowest in whole system, user/admin didn't care about the waiting of io read/write is sync or async, they just want to know how much loading of the entire server I already remove iostat %util displaying because it's non-sense anymore. but 100% iowait of io-uring on cpu, is terrify a lot of users/admins.

beldzhang commented 4 months ago

ready for testing, for-6.10/io_uring? for-6.10/block or for-next? thanks.

beldzhang commented 2 months ago

emmm.... any updates?

solarvm commented 1 month ago

still happening for us too

isilence commented 1 month ago

Nothing was merged yet, as it's a low priority reporting issue. However, there is interest in that for some other reasons, and it's in the backlog. will get picked up hopefully soon.