datto / dattobd

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
569 stars 121 forks source link

Add support for submit_bio to support newer kernels #280

Closed jnse closed 1 year ago

jnse commented 2 years ago

Add support for submit_bio tracing in addition to make_request_fn in order to make kernels >= 5.9 work. This is done by keeping track of a tracing and 'real' gendisk in order to avoid a symbol lookup for blk_md_submit_bio

nickchen-cpu commented 2 years ago

Hi jnse

Thanks for sharing PR.

[Concerns]

  1. From kernel v5.9, there are some conditions where we check if disk->fops->submit_bio exist in kernel source code to route to multi-queue logic.

  2. For example, SCSI disk, whose submit_bio is initially NULL, which leads it to some multi-queue logic in kernel.

[What dattobd does]

  1. Intercept block_device->gendisk
  2. Use dattobd_null_mrf to mock fops->submit_bio ptr. (I saw you finally call global submit_bio)

[Questions]

  1. Do we still have chance to go into the __submit_bio_noacct_mq after you give the existence to bio->bi_disk->fops->submit_bio as dattobd_null_mrf?

  2. Is dattobd_null_mrf able to handle situation where bio->bi_disk->fops->submit_bio should've been checked as NULL?

blk_qc_t submit_bio_noacct(struct bio *bio)
{
...
...
    if (!bio->bi_disk->fops->submit_bio)  
        return __submit_bio_noacct_mq(bio);
    return __submit_bio_noacct(bio);
}

[Reference] https://elixir.bootlin.com/linux/v5.9/source/block/blk-core.c

Best Regards, Nick

jnse commented 2 years ago

Yes, if the gendisk's fops->submit_bio is null the logic in the kernel will invoke blk_mq_submit_bio so it is safe for us to just pass the null through. ( Reference: https://elixir.bootlin.com/linux/v5.9.16/source/block/blk-core.c#L1084 )

It so happens that in most of the testing we did on 5.9, we were testing this exact situation (5.9 with scsi disks) - and during debug stepping the fops->submit_bio was indeed null as expected, and this i/o was being correctly submitted to the real disks after dattobd's done with it and we swap the gendisk back to the original.

And (maybe I should have started with this) - when we're at the point where we submit the i/o inside of dattobd_null_mrf - the gendisk ->fops->submit_bio is 0 - not dattobd_null_mrf

#0  dattobd_null_mrf (bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/mrf.c:70
#1  0xffffffffa00078e8 in dattobd_submit_bio_real (dev=0xffff88848e2e0800, bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/submit_bio.c:35
#2  0xffffffffa00099fe in inc_trace_bio (dev=0xffff88848e2e0800, bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/tracer.c:347
#3  0xffffffffa000b21e in tracing_fn (bio=0xffff88848f867e40) at /home/johnsennesael/src/dattobd/src/tracer.c:1343
...
(gdb) p bio->bi_disk->fops->submit_bio
$2 = (blk_qc_t (*)(struct bio *)) 0x0

This is because dattobd_submit_bio_real swaps back the original saved submit_bio function pointer (which in the case of scsi disks on 5.9 is 0 as you pointed out)

nickchen-cpu commented 2 years ago

Hi @jnse

I want to discuss about the shallow copy for gendisk and fops. Do we also need to shallow copy the gendisk->queue? (I have no right to review or make comment, so maybe my declaration is unclear.)

[Discussion]

In __tracer_should_reset_mrf, bdev_get_queue is used to assure to destroy the said device based on the same queue address. (which matches queue to assure that) But we only "shallow copy" the gendisk and fops, not queue, which results in having the several same queue addresses in sda1 sda2 sda3 sda4's copy_gendisk if we snapshot them at the same time.

Do we also need to shallow copy the gendisk->queue in case matching the same queue address in __tracer_should_reset_mrf->bdev_get_queue ?

https://github.com/datto/dattobd/blob/63c584b6a540593a006fb08cfcbc3b951cf919ce/src/tracer.c#L1477

[Example]

  1. Snapshotting sda1 and sda2.
sda1 ----X---->sda->original_queue
    |
    ---->copy_sda#1->original_queue

sda2 ----X---->sda->original_queue
    |
    ---->copy_sda#2->original_queue
  1. sbdctl destroy <sda2's minor>
  2. Go to __tracer_should_reset_mrf to check if able to destroying sda2
  3. At the time tracer_for_each in __tracer_should_reset_mrf traverse to sda1, sda2->copy_sda#2->queue == sda1->copy_sda#1->queue , so it returns 0 in __tracer_should_reset_mrf.
  4. Finally failed to destroy sda2.

Best Regards, Nick

jnse commented 2 years ago

(I have no right to review or make comment. so maybe my declaration is unclear.)

No, please review and comment! All help is always welcome.

for example, I snapshotted sda1 and sda2, and then destroy one of them.

Yes, I think this could be a potential issue since the gendisk exists per physical device (not partition). However, we also have to be very careful with having multiple queues because that could introduce races or duplicated bio submission. This is similar to the problem that already exists as @dakotarwilliams pointed out in #276 - IIRC the solution I discussed with dakota involved adding reference counting to the gendisk.

jnse commented 2 years ago

@nickchen-cpu FYI - i pushed up some changes just now in an attempt to address some of the problems with multiple partitions. 1) I cherry-picked @dakotarwilliams 's changes from his branch which implement reference-counting for gendisk's/mrf's using a hashtable w/ minor changes to make it compile in the submit_bio case : https://github.com/jnse/dattobd/commit/080b34be210a7e94734e7a4d93b42dddfa7a5ea0 2) I realized there is a potential problem with swapping out the gendisk in the case of multiple partitions - we risk doing the swapping whilst we're in the middle of handling io for another block-device on the same disk, and in an attempt to mitigate that, I piggy-backed on dakota's hashtable accounting struct and added a mutex there we can lock whenever we're swapping the gendisk or submitting io to the orig block device. - https://github.com/jnse/dattobd/commit/fb23dc6d6072bdeeeea631e7456d7d0a0411e887 3) there's some debug print's in there that will be deleted later, because:

In spite of all the above, when i test this on my test vm, i still get lock-ups when issuing writes to a second partition.

I'll be out of office for 2 weeks, so I'm pushing all this up for now to allow @dakotarwilliams to work on it-

dakotarwilliams commented 1 year ago

Closing, changes are included in #288.