ceph / dmclock

Code that implements the dmclock distributed quality of service algorithm. See "mClock: Handling Throughput Variability for Hypervisor IO Scheduling" by Gulati, Merchant, and Varman.
Other
88 stars 54 forks source link

There are "out of order op" errors while testing qa suites #20

Closed Yan-waller closed 7 years ago

Yan-waller commented 8 years ago

@ivancich hi sir, we saw errors in the qa testing as following which seems to be out of order op, could you take a look at this:

2016-09-07T13:41:05.288 INFO:tasks.rados.rados.0.plana133.stdout:2473:  finishing write tid 4 to ---plana1338849-11 
2016-09-07T13:41:05.291 INFO:tasks.rados.rados.0.plana133.stderr:Error: finished tid 4 when last_acked_tid was 5
2016-09-07T13:41:05.293 INFO:teuthology.orchestra.run.plana133:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd unset noscrub'
2016-09-07T13:41:05.303 INFO:tasks.ceph.osd.4.plana138.stderr:2016-09-07 13:41:05.298396 7ff185546700 -1 received  signal: Hangup from  PID: 10335 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 4  UID: 0
2016-09-07T13:41:05.410 INFO:tasks.ceph.osd.2.plana133.stderr:2016-09-07 13:41:05.398355 7fd7d2de1700 -1 received  signal: Hangup from  PID: 7952 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  UID: 0
2016-09-07T13:41:05.501 INFO:tasks.ceph.osd.4.plana138.stderr:2016-09-07 13:41:05.499323 7ff185546700 -1 received  signal: Hangup from  PID: 10335 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 4  UID: 0
2016-09-07T13:41:05.606 INFO:tasks.ceph.osd.2.plana133.stderr:2016-09-07 13:41:05.605954 7fd7d2de1700 -1 received  signal: Hangup from  PID: 7952 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  UID: 0
2016-09-07T13:41:05.644 INFO:teuthology.orchestra.run.plana138:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.3.asok dump_ops_in_flight'
2016-09-07T13:41:05.702 INFO:tasks.ceph.osd.3.plana138.stderr:2016-09-07 13:41:05.702094 7f0d1cd2b700 -1 received  signal: Hangup from  PID: 28456 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 3  UID: 0
2016-09-07T13:41:05.706 INFO:tasks.rados.rados.0.plana133.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2034-gd9ad852/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f2e2f7fe700 time 2016-09-07 13:41:05.250253
2016-09-07T13:41:05.709 INFO:tasks.rados.rados.0.plana133.stderr:/srv/autobuild-ceph/gitbuilder.git/build/out~/ceph-11.0.0-2034-gd9ad852/src/test/osd/RadosModel.h: 866: FAILED assert(0)
2016-09-07T13:41:05.722 INFO:tasks.rados.rados.0.plana133.stderr: ceph version v11.0.0-2034-gd9ad852 (d9ad852b9b7f54c4993b63f6651063b479c4de2c)
2016-09-07T13:41:05.724 INFO:tasks.rados.rados.0.plana133.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x56002930651b]
2016-09-07T13:41:05.727 INFO:tasks.rados.rados.0.plana133.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x4b8) [0x5600292f3238]
2016-09-07T13:41:05.728 INFO:tasks.rados.rados.0.plana133.stderr: 3: (write_callback(void*, void*)+0x19) [0x560029305a89]
2016-09-07T13:41:05.730 INFO:tasks.rados.rados.0.plana133.stderr: 4: (librados::C_AioSafe::finish(int)+0x1d) [0x7f2e4d2d292d]
2016-09-07T13:41:05.731 INFO:tasks.rados.rados.0.plana133.stderr: 5: (Context::complete(int)+0x9) [0x7f2e4d2b68b9]
2016-09-07T13:41:05.733 INFO:tasks.rados.rados.0.plana133.stderr: 6: (()+0x142cd6) [0x7f2e4d363cd6]
2016-09-07T13:41:05.734 INFO:tasks.rados.rados.0.plana133.stderr: 7: (()+0x8182) [0x7f2e4c7d0182]
2016-09-07T13:41:05.736 INFO:tasks.rados.rados.0.plana133.stderr: 8: (clone()+0x6d) [0x7f2e4b56247d]
2016-09-07T13:41:05.738 INFO:tasks.rados.rados.0.plana133.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-09-07T13:41:05.803 INFO:tasks.ceph.osd.2.plana133.stderr:2016-09-07 13:41:05.802800 7fd7d2de1700 -1 received  signal: Hangup from  PID: 7952 task name: /usr/bin/python /usr/bin/daemon-helper kill ceph-osd -f --cluster ceph -i 2  UID: 0

the branch: https://github.com/ceph/ceph/tree/wip_dmclock2

and testcase:

description: rados/thrash-erasure-code/{leveldb.yaml rados.yaml clusters/{fixed-2.yaml
    openstack.yaml} fs/btrfs.yaml msgr-failures/fastclose.yaml thrashers/default.yaml
    workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}

I switched the default op_queue to mclock_opclass in src/common/config_opts.h

-OPTION(osd_op_queue, OPT_STR, "wpq")
+OPTION(osd_op_queue, OPT_STR, "mclock_opclass") 
ivancich commented 8 years ago

Thank you; I am looking into it.

Yan-waller commented 8 years ago

Hi Eric, we can reproduce it and the logs are as following:

ivancich commented 8 years ago

I'm trying to reproduce it; I haven't been able to so far.

In your 2016-09-07 run, you said you changed the source code to force osd_op_queue to mclock_opclass. But the conf.yaml shows that it was changed to debug_random for the test. It could have chosen mclock_opclass or it could have chosen one of the other op queues. Are the OSD logs available to me? I'm unfamiliar with daisycloud.org.

You can see some recent runs here:

http://pulpito.ceph.com/ivancich-2016-09-09_18:24:23-dmclock:rados:thrash-erasure-code-wip_dmclock2---basic-smithi

http://pulpito.ceph.com/ivancich-2016-09-09_18:24:20-dmclock:rados:thrash-erasure-code-wip_dmc2_mod_weights---basic-smithi

Thanks!

Yan-waller commented 8 years ago

Yes, here is another test job 443 that reproduced it , and the relevant OSD log file is attached:

thanks, and the daisycloud.org is just a public site, that we used to upload our internal teuthology testing results and post it.

ivancich commented 8 years ago

Thank you. I've tried recreating the error on teuthology and on my desktop using the same parameters that you seem to be working on, and so far have not been able to.

One thing I want to verify that you're aware of is that your modification of _src/common/configopts.h does not guarantee that mclock_opclass queue will be used. It only sets the default value. But the test you listed:

description: rados/thrash-erasure-code/{leveldb.yaml rados.yaml clusters/{fixed-2.yaml
    openstack.yaml} fs/btrfs.yaml msgr-failures/fastclose.yaml thrashers/default.yaml
    workloads/ec-rados-plugin=jerasure-k=2-m=1.yaml}

The rados.yaml file overrides that setting to _debugrandom, and that could result in an mclock op queue or one of the other ones. I did verify that it was the mclock_client queue in the OSD log you included in your previous post.

Do you have a sense of how often this error occurs with an mclock queue? Is it consistent or rare?

Again, thank you for all of your help.

Yan-waller commented 8 years ago

It's consistent in our environment and I can 100% reproduce it.

My testing environment is ubuntu 14.04 , and the teuthology command is as below:

teuthology@scheduler:~$ teuthology-suite --suite rados --limit 200  --ceph wip_dmclock2 --suite-branch master -m plana -d ubuntu --suite-dir /home/foo1/src/ceph-qa-suite_master --p 100

Regularly, it occurs in the test jobs mentioned above or other ones that are related to the ceph_test_rados testcase.

Please let me know if there is something else I can do. by the way, what's the difference between mclock_opclass and mclock_client? Thanks!

ivancich commented 8 years ago

OK, I'm trying it again specifying both plana and ubuntu.

mclock_opclass keeps five queues depending on the operation class -- client, osd sub op, snaptrim, recovery, and scrub.

With mclock_client, we also keep a separate queue for each client, so we can use the algorithm to impose fairness between clients.

ivancich commented 8 years ago

I'd forgotten that the sepia lab's planas were decommissioned late last year.

ivancich commented 8 years ago

Yan-waller, would you mind running your tests on branch wip_dmc2_ooo_instrum?

It has added instrumentation. And if the osd op queue is set to _debugrandom it will force it to be _mclockclient.

Thank you!

Yan-waller commented 8 years ago

Hi Eric, I have run the tests on branch wip_dmc2_ooo_instrum, it reproduced and the result is posted on http://www.daisycloud.org:9091/ as following:

there are 43 failed test jobs, most of them are related to this problem. and the osd log files of failed test job 220 and 243 :

ivancich commented 8 years ago

Thank you. I'm looking at the results. If you don't mind my asking, who do you work for and where is your teuthology lab?

Yan-waller commented 8 years ago

I'm working in ZTE corporation. We have built a teuthology lab locally in our work room, which is synchronized with the ceph developer community.

ivancich commented 8 years ago

Hi Yan-Waller, I've made some additional changes in wip_dmc2_ooo_instrum. Would you mind re-testing that branch?

I can see where the operations are being queued out of order. I now need to try to find where that's happening. Unfortunately there are over 50 calls to _requeueops, so I've added some messaging to track down which call to focus on. Thank you!

Yan-waller commented 8 years ago

OK, I've run this branch and the results showed as following:

Yan-waller commented 8 years ago

Hi Eric, I see that DO_NOT_DELAY_TAG_CALC is not defined by default, which means we will delay to calculate tag at dmclock_server.h:795 ? refer to README.md, it's an optimization/fix to the published algorithm, is this just for better performance or something else?

Thank you!

ivancich commented 8 years ago

Hi Yan-waller,

It's for more even performance. When a request reaches the front of the queue for that client, the rho and delta values that were sent with it can be very old. So we use the most recent rho and delta values received from that client. That made the iops more stable over time. Without that the iops for a client would increase then decrease then increase and so forth.

This problem and fix was discovered and proposed by Byung-Su Park and Taewoong Kim from SK Telecom.

Yan-waller commented 8 years ago

oh, I see, that's a great fixup. thanks.

ivancich commented 7 years ago

Hi Yan-Waller, Thank you again for running with the added instrumentation. By looking at the logs you generated I believe I've diagnosed the issue and resolved it. But since I'm unable to reproduce the bug, would you please check the current version of the wip_dmc2_ooo_instrum branch of ceph? Thank you!!

Yan-waller commented 7 years ago

Hi Eric, I have run the latest wip_dmc2_ooo_instrum branch these two days, and no error was found as expected. It seems that this issue has been resolved. Thank you!

ivancich commented 7 years ago

Thank you for bringing my attention to the issue and for running the tests that allowed me to resolve it.