Closed Yan-waller closed 7 years ago
Thank you; I am looking into it.
Hi Eric, we can reproduce it and the logs are as following:
I'm trying to reproduce it; I haven't been able to so far.
In your 2016-09-07 run, you said you changed the source code to force osd_op_queue to mclock_opclass. But the conf.yaml shows that it was changed to debug_random for the test. It could have chosen mclock_opclass or it could have chosen one of the other op queues. Are the OSD logs available to me? I'm unfamiliar with daisycloud.org.
You can see some recent runs here:
Thanks!
Yes, here is another test job 443
that reproduced it , and the relevant OSD log file is attached:
thanks, and the daisycloud.org
is just a public site, that we used to upload our internal teuthology testing results and post it.
Thank you. I've tried recreating the error on teuthology and on my desktop using the same parameters that you seem to be working on, and so far have not been able to.
One thing I want to verify that you're aware of is that your modification of _src/common/configopts.h does not guarantee that mclock_opclass queue will be used. It only sets the default value. But the test you listed:
description: rados/thrash-erasure-code/{leveldb.yaml rados.yaml clusters/{fixed-2.yaml
openstack.yaml} fs/btrfs.yaml msgr-failures/fastclose.yaml thrashers/default.yaml
workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}
The rados.yaml file overrides that setting to _debugrandom, and that could result in an mclock op queue or one of the other ones. I did verify that it was the mclock_client queue in the OSD log you included in your previous post.
Do you have a sense of how often this error occurs with an mclock queue? Is it consistent or rare?
Again, thank you for all of your help.
It's consistent in our environment and I can 100% reproduce it.
My testing environment is ubuntu 14.04
, and the teuthology command is as below:
teuthology@scheduler:~$ teuthology-suite --suite rados --limit 200 --ceph wip_dmclock2 --suite-branch master -m plana -d ubuntu --suite-dir /home/foo1/src/ceph-qa-suite_master --p 100
Regularly, it occurs in the test jobs mentioned above or other ones that are related to the ceph_test_rados
testcase.
Please let me know if there is something else I can do.
by the way, what's the difference between mclock_opclass
and mclock_client
?
Thanks!
OK, I'm trying it again specifying both plana and ubuntu.
mclock_opclass keeps five queues depending on the operation class -- client, osd sub op, snaptrim, recovery, and scrub.
With mclock_client, we also keep a separate queue for each client, so we can use the algorithm to impose fairness between clients.
I'd forgotten that the sepia lab's planas were decommissioned late last year.
Yan-waller, would you mind running your tests on branch wip_dmc2_ooo_instrum?
It has added instrumentation. And if the osd op queue is set to _debugrandom it will force it to be _mclockclient.
Thank you!
Hi Eric, I have run the tests on branch wip_dmc2_ooo_instrum, it reproduced and the result is posted on http://www.daisycloud.org:9091/
as following:
there are 43 failed test jobs, most of them are related to this problem. and the osd log files of failed test job 220
and 243
:
Thank you. I'm looking at the results. If you don't mind my asking, who do you work for and where is your teuthology lab?
I'm working in ZTE corporation. We have built a teuthology lab locally in our work room, which is synchronized with the ceph
developer community.
Hi Yan-Waller, I've made some additional changes in wip_dmc2_ooo_instrum. Would you mind re-testing that branch?
I can see where the operations are being queued out of order. I now need to try to find where that's happening. Unfortunately there are over 50 calls to _requeueops, so I've added some messaging to track down which call to focus on. Thank you!
OK, I've run this branch and the results showed as following:
Hi Eric,
I see that DO_NOT_DELAY_TAG_CALC
is not defined by default, which means we will delay to calculate tag at dmclock_server.h:795 ? refer to README.md, it's an optimization/fix to the published algorithm, is this just for better performance or something else?
Thank you!
Hi Yan-waller,
It's for more even performance. When a request reaches the front of the queue for that client, the rho and delta values that were sent with it can be very old. So we use the most recent rho and delta values received from that client. That made the iops more stable over time. Without that the iops for a client would increase then decrease then increase and so forth.
This problem and fix was discovered and proposed by Byung-Su Park and Taewoong Kim from SK Telecom.
oh, I see, that's a great fixup. thanks.
Hi Yan-Waller, Thank you again for running with the added instrumentation. By looking at the logs you generated I believe I've diagnosed the issue and resolved it. But since I'm unable to reproduce the bug, would you please check the current version of the wip_dmc2_ooo_instrum branch of ceph? Thank you!!
Hi Eric, I have run the latest wip_dmc2_ooo_instrum branch these two days, and no error was found as expected. It seems that this issue has been resolved. Thank you!
Thank you for bringing my attention to the issue and for running the tests that allowed me to resolve it.
@ivancich hi sir, we saw errors in the qa testing as following which seems to be
out of order op
, could you take a look at this:the branch: https://github.com/ceph/ceph/tree/wip_dmclock2
and testcase:
I switched the default op_queue to
mclock_opclass
insrc/common/config_opts.h