cloudbase / wnbd

Windows Ceph RBD NBD driver
GNU Lesser General Public License v2.1
57 stars 26 forks source link

High CPU usage when there is no IO active #98

Closed pcuzner closed 1 year ago

pcuzner commented 1 year ago

I'm running inside libvirt based VM's, and noticed that after my test workload completed the CPU was still busy in the W2K19 server and never returns to nominal levels (it's basically consuming a full core all the time)

Here's the job

[global]
time_based=1
#directory=c\:\fio
filename=\\.\PHYSICALDRIVE1
numjobs=1
runtime=60
ioengine=windowsaio
ramp_time=10
clocksource=gettimeofday
refill_buffers
direct=1
size=5G
thread=1

[workload]
readwrite=randrw
blocksize=4KB
iodepth=4
rate_iops=45,5

And here's a trace through perfmon which shows the period of I/O in blue, and the CPU usage of the rbd-wnbd#1 process in red. image

This is the version info

PS C:\Users\Administrator\Documents\fio> wnbd-client.exe version
wnbd-client.exe: 0.3.1-29-g9a02146
libwnbd.dll: 0.3.1-29-g9a02146
wnbd.sys: 0.3.1-29-g9a02146

Any ideas?

petrutlucian94 commented 1 year ago

I think I saw the same issue yesterday while running driver certification tests. In my case, one of the three OSDs crashed right before that (it just ran out of disk space).

I'll reproduce the issue and attach a profiler, Windows Performance Recorder, which should help us pinpoint the problem.

Thanks again for the report.

petrutlucian94 commented 1 year ago

Good news, I've reproduced the issue and it seems to be caused by the ceph network disconnect procedure, potentially related to the new poll driver. It's calling PollDriver::event_wait indefinitely, so most probably a return code isn't checked properly: https://github.com/ceph/ceph/blob/26f55e2064f5be1332e5793dc2418f0a43438bb1/src/msg/async/EventPoll.cc#L160-L191

LE: I can confirm that it's caused by the poll driver, switching to the select driver fixes the issue: https://github.com/ceph/ceph/blob/510284b66513490445619d1430aa869868c71a09/src/msg/async/Event.cc#L130-L133. I'll need to fix the poll event driver though since it was introduced in order to avoid the select limitations.

LLE: I've submitted a fix, also applied it to our downstream fork. New MSIs containing the fix should be available by tomorrow.

process_hacker_high_cpu_usage

wsapoll_loop

petrutlucian94 commented 1 year ago

The Ceph PR merged, I'll go ahead and close this issue. Feel free to reopen it if you still notice unexpectedly high cpu usage.

pcuzner commented 1 year ago

Thanks @petrutlucian94