datto / dattobd

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
561 stars 120 forks source link

When snapshot sda2, does sda2 need to change its mrf if it's already changed? #276

Closed nickchen-cpu closed 1 year ago

nickchen-cpu commented 2 years ago

Hi

Disk conf: sda1, sda2 Steps:

  1. First snapshot sda1;
  2. In __tracer_transition_tracing , sda1->sda->queue->make_request is replaced with tracing_mrf, and so is sda2->sda->queue->make_request;
  3. Snapshot the sda2, here's the point: We still run __tracer_transition_tracing for sda2, which will again: (1) Freeze the sda; (2) Change sda2->sda->queue->make_request to tracing_mrf. but it already is in step 2.

Could we do some precheck to prevent this?

Best Regards, Nick

dakotarwilliams commented 2 years ago

The double assignment isn't a problem. However, when you'd go to destroy those devices we'd certainly have a problem. When we snapshot sda2, we record the original mrf so that we can restore it later; this causes the problem since at this point sda2 original will have been tracing_mrf. If the destroys come in the same sequence as the snapshots, then sda will end up with tracing_mrf as its callback after there are no snap devices associated with it. In summary, the following would happen:

  1. snapshot sda1 - orig_mrf is recorded normally, tracing_mrfis swapped in
  2. snapshot sda2 - orig_mrf is set to tracing_mrf, tracing_mrf is swapped in
  3. destroy sda1 - orig_mrf (which is the actual fn pointer) is swapped back into sda->queue->make_request
  4. destroy sda2 - orig_mrf (which is tracing_mrf is swapped back into sda->queue->make_request

And this small problem gets larger if the module is unloaded!

Good catch! Initial idea for resolving would be a kind of reference count for mrf swaps. We could create a couple of functions gendisk_mrf_get and gendisk_mrf_put and a map that associate gendisk* and orig_mrfs along with how many times they've been accessed.

Finix1979 commented 1 year ago

snapshot sda1 - orig_mrf is recorded normally, tracing_mrfis swapped in snapshot sda2 - orig_mrf is set to tracing_mrf, tracing_mrf is swapped in destroy sda1 - orig_mrf (which is the actual fn pointer) is swapped back into sda->queue->make_request

After sda1 is out of tracing(because make_requset is assigned back to orig_mrf), the tracing function of sda2 is also disabled, because they share the same device queue, right?

Finix1979 commented 1 year ago

Indeed. Thank you @nickchen-cpu