Closed pcuzner closed 1 year ago
I think I saw the same issue yesterday while running driver certification tests. In my case, one of the three OSDs crashed right before that (it just ran out of disk space).
I'll reproduce the issue and attach a profiler, Windows Performance Recorder, which should help us pinpoint the problem.
Thanks again for the report.
Good news, I've reproduced the issue and it seems to be caused by the ceph network disconnect procedure, potentially related to the new poll driver. It's calling PollDriver::event_wait
indefinitely, so most probably a return code isn't checked properly: https://github.com/ceph/ceph/blob/26f55e2064f5be1332e5793dc2418f0a43438bb1/src/msg/async/EventPoll.cc#L160-L191
LE: I can confirm that it's caused by the poll driver, switching to the select driver fixes the issue: https://github.com/ceph/ceph/blob/510284b66513490445619d1430aa869868c71a09/src/msg/async/Event.cc#L130-L133. I'll need to fix the poll event driver though since it was introduced in order to avoid the select
limitations.
LLE: I've submitted a fix, also applied it to our downstream fork. New MSIs containing the fix should be available by tomorrow.
The Ceph PR merged, I'll go ahead and close this issue. Feel free to reopen it if you still notice unexpectedly high cpu usage.
Thanks @petrutlucian94
I'm running inside libvirt based VM's, and noticed that after my test workload completed the CPU was still busy in the W2K19 server and never returns to nominal levels (it's basically consuming a full core all the time)
Here's the job
And here's a trace through perfmon which shows the period of I/O in blue, and the CPU usage of the rbd-wnbd#1 process in red.
This is the version info
Any ideas?