Closed jnse closed 1 year ago
Hi jnse
Thanks for sharing PR.
[Concerns]
From kernel v5.9, there are some conditions where we check if disk->fops->submit_bio
exist in kernel source code to route to multi-queue logic.
For example, SCSI disk, whose submit_bio
is initially NULL
, which leads it to some multi-queue logic in kernel.
[What dattobd does]
block_device->gendisk
dattobd_null_mrf
to mock fops->submit_bio
ptr. (I saw you finally call global submit_bio
)[Questions]
Do we still have chance to go into the __submit_bio_noacct_mq
after you give the existence to bio->bi_disk->fops->submit_bio
as dattobd_null_mrf
?
Is dattobd_null_mrf
able to handle situation where bio->bi_disk->fops->submit_bio
should've been checked as NULL
?
blk_qc_t submit_bio_noacct(struct bio *bio)
{
...
...
if (!bio->bi_disk->fops->submit_bio)
return __submit_bio_noacct_mq(bio);
return __submit_bio_noacct(bio);
}
[Reference] https://elixir.bootlin.com/linux/v5.9/source/block/blk-core.c
Best Regards, Nick
Yes, if the gendisk's fops->submit_bio is null the logic in the kernel will invoke blk_mq_submit_bio so it is safe for us to just pass the null through. ( Reference: https://elixir.bootlin.com/linux/v5.9.16/source/block/blk-core.c#L1084 )
It so happens that in most of the testing we did on 5.9, we were testing this exact situation (5.9 with scsi disks) - and during debug stepping the fops->submit_bio was indeed null as expected, and this i/o was being correctly submitted to the real disks after dattobd's done with it and we swap the gendisk back to the original.
And (maybe I should have started with this) - when we're at the point where we submit the i/o inside of dattobd_null_mrf - the gendisk ->fops->submit_bio is 0 - not dattobd_null_mrf
#0 dattobd_null_mrf (bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/mrf.c:70
#1 0xffffffffa00078e8 in dattobd_submit_bio_real (dev=0xffff88848e2e0800, bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/submit_bio.c:35
#2 0xffffffffa00099fe in inc_trace_bio (dev=0xffff88848e2e0800, bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/tracer.c:347
#3 0xffffffffa000b21e in tracing_fn (bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/tracer.c:1343
...
(gdb) p bio->bi_disk->fops->submit_bio
$2 = (blk_qc_t (*)(struct bio *)) 0x0
This is because dattobd_submit_bio_real swaps back the original saved submit_bio function pointer (which in the case of scsi disks on 5.9 is 0 as you pointed out)
Hi @jnse
I want to discuss about the shallow copy for gendisk
and fops
.
Do we also need to shallow copy the gendisk->queue
?
(I have no right to review or make comment, so maybe my declaration is unclear.)
[Discussion]
In __tracer_should_reset_mrf
, bdev_get_queue
is used to assure to destroy the said device based on the same queue address. (which matches queue
to assure that)
But we only "shallow copy" the gendisk
and fops
, not queue
, which results in having the several same queue addresses in sda1 sda2 sda3 sda4
's copy_gendisk if we snapshot them at the same time.
Do we also need to shallow copy the gendisk->queue
in case matching the same queue
address in __tracer_should_reset_mrf->bdev_get_queue
?
https://github.com/datto/dattobd/blob/63c584b6a540593a006fb08cfcbc3b951cf919ce/src/tracer.c#L1477
[Example]
sda1
and sda2
.sda1 ----X---->sda->original_queue
|
---->copy_sda#1->original_queue
sda2 ----X---->sda->original_queue
|
---->copy_sda#2->original_queue
sbdctl destroy <sda2's minor>
__tracer_should_reset_mrf
to check if able to destroying sda2
tracer_for_each
in __tracer_should_reset_mrf
traverse to sda1
, sda2->copy_sda#2->queue == sda1->copy_sda#1->queue
, so it returns 0 in __tracer_should_reset_mrf
.sda2
.Best Regards, Nick
(I have no right to review or make comment. so maybe my declaration is unclear.)
No, please review and comment! All help is always welcome.
for example, I snapshotted sda1 and sda2, and then destroy one of them.
Yes, I think this could be a potential issue since the gendisk exists per physical device (not partition). However, we also have to be very careful with having multiple queues because that could introduce races or duplicated bio submission. This is similar to the problem that already exists as @dakotarwilliams pointed out in #276 - IIRC the solution I discussed with dakota involved adding reference counting to the gendisk.
@nickchen-cpu FYI - i pushed up some changes just now in an attempt to address some of the problems with multiple partitions. 1) I cherry-picked @dakotarwilliams 's changes from his branch which implement reference-counting for gendisk's/mrf's using a hashtable w/ minor changes to make it compile in the submit_bio case : https://github.com/jnse/dattobd/commit/080b34be210a7e94734e7a4d93b42dddfa7a5ea0 2) I realized there is a potential problem with swapping out the gendisk in the case of multiple partitions - we risk doing the swapping whilst we're in the middle of handling io for another block-device on the same disk, and in an attempt to mitigate that, I piggy-backed on dakota's hashtable accounting struct and added a mutex there we can lock whenever we're swapping the gendisk or submitting io to the orig block device. - https://github.com/jnse/dattobd/commit/fb23dc6d6072bdeeeea631e7456d7d0a0411e887 3) there's some debug print's in there that will be deleted later, because:
In spite of all the above, when i test this on my test vm, i still get lock-ups when issuing writes to a second partition.
I'll be out of office for 2 weeks, so I'm pushing all this up for now to allow @dakotarwilliams to work on it-
Closing, changes are included in #288.
Add support for submit_bio tracing in addition to make_request_fn in order to make kernels >= 5.9 work. This is done by keeping track of a tracing and 'real' gendisk in order to avoid a symbol lookup for blk_md_submit_bio