PARTITION_STYLE="msdos" broken on storage with 4k blocks

dpavlin commented 6 years ago

debootstrap on nodes with 4k storage blocks will create invalid partition table which won't boot inside kvm since it expects partition table with 512b blocks.

PARTITION_ALIGNMENT is currently broken and I opted to remove it all together because with 4k blocks it has VERY different behavior if you are using 2048 or 1M as argument. Since this is triggered only when PARTITION_STYLE="msdos" is used, and 1Mb os enough space i decided to hard-code that.

I'm open to suggestions how to fix this better or in cleaner way.

First commit is change which comes from Debian and is required for recent sfdisk versions, I don't know why this repository doesn't have it. It comes from https://sources.debian.org/patches/ganeti-instance-debootstrap/0.16-2.1/

googlebot commented 6 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

dpavlin commented 6 years ago

I signed it!

googlebot commented 6 years ago

So there's good news and bad news.

:thumbsup: The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

:confused: The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

dpavlin commented 6 years ago

Oh, it seems that @apoikos first has to submit his part of change because I can't sign licence agreement for him.

iustin commented 6 years ago

@apoikos - good question, do you have any other pending patches in the Debian packages?

apoikos commented 6 years ago

@iustin: no, that's the only patch left. I'm fine with including it in this PR.

apoikos commented 6 years ago

Of course I mean the only patch left for instance-debootstrap. There are many more for ganeti itself which I'll be upstreaming slowly.

iustin commented 6 years ago

Hmm, I'm not in best position to approve a patch with cla not able to sign it. @apoikos, would you mind much to send your patch separately?

apoikos commented 6 years ago

@iustin, submitted #3.

iustin commented 6 years ago

@apoikos, thanks, #3 is merged now. @dpavlin, can you resubmit with just your changes?

googlebot commented 6 years ago

CLAs look good, thanks!

iustin commented 6 years ago

Replying on the general thread since I guess the individual comments will go away once resolved.

Thanks for the doc you pointed to (https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt), I understand now what and how you're trying to fix.

If I understand correctly, the root of the problem is that a 4K-sector drive (under Linux) is still exposed as a 512b-sector drive when booted from via KVM. And you plan to do this by repartitioning the first partition to the same absolute byte offset (1M) after it being mounted by losetup since that exposes a 512b sector similar to how kvm expected. In other words, the partition as first created by the initial sfdisk would have 256 sector offset (256*4K=1MB), after redoing over losetup/512b the offset will be 2048 (2048*512b=same 1MB). Correct?

I see a few problems with this approach.

The first one is that this relies on KVM forever using 512b; if at one point it changes and actually picks up the backing device sector size, you'll break existing instances.

Second, you're also relying on losetup matching the same 512b. If it changes behaviour, you'll again break things because it won't necessarily match KVM. You could pass the -b argument to it, but this seems to only be supported in recent kernels… In any case, you should still pass 1MB to the sfdisk-over-losetup invocation, hardcoding 2048 here doesn't make sense.

And last, is that this is forcing everybody to use 512b sectors, even if not using KVM. The fix should at least add a configurable key to say "force 512b-sector partitions".

IMO the proper fix here would be to make KVM pick up the correct block size (see https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04458.html for example), but if you want to make this purely on the instance-debootstrap side, something along the following lines would be much cleaner:

replace 2048 with 1M, or even better, make the value default to empty string, so that sfdisk applies its default offset
add a setting to always use losetup, with the appropriate documentation, so instead of partitioning with 4K, formatting, installing, then repartitioning, it simpy wraps everything under losetup so the partitioning is only done once.

What do you think?

apoikos commented 6 years ago

I'd like to add a couple of notes:

First of all, this issue affects only setups that use a 4K logical block size. Setups with 4K physical block size and 512-byte logical block size work fine (at least ours does).

The root cause for this issue is that the partition table does not store any information about the sector size; the latter is provided by the disk itself and has to be looked up in order to convert LBA addresses to byte offsets. In our case, the issue arises because we partition the disk on the host (which sees the hardware's 4K logical block size), but go on to use it in the guest (which only sees a 512-byte sector); if we were running debootstrap/partitioning from inside the guest, then no problem would arise — other than the guest trying to do suboptimal 512-byte I/O.

The solution proposed by @dpavlin is to re-partition the device using 512-byte sectors through losetup. This works, but has the following drawbacks:

The guest will continue to do suboptimal 512-byte I/O
The guest's disk requires special handling to be accessed from inside the host, as kpartx will not work, possibly breaking instance-debootstrap's workflow — for example, instance renames or disk exports will not work as far as I can tell.

Instead of trying to settle this disagreement between guest and host, we could try to make sure that there is no disagreement at all! qemu has supported variable block sizes since 0.13, via the physical_block_size and logical_block_size blockdev options. In theory, all we would need to do is check the backing store's block size — if it's a raw device — using the respective sysfs attributes and add the relevant options to the device in KVM's command line. However, this also has its downsides:

It might break existing instances that somehow got a working partition table (e.g. if the OS provider does partitioning from within the instance), so we'd need some way of turning it off
I have no idea how this will interact with instance mobility. Imagine for instance a DRBD setup where one side has a 4K backing store and the other one a 512-byte. I'm not sure if DRBD would work in this case, but if it did we might end up using different logical block sizes pre- and post-failover/migration. To really protect against this, we would need to persist the logical block size that was used during instance creation in the instance's config and use it henceforth.

Overall, settling for 512-byte sectors universally is not a bad idea: it's the least common denominator and offers the best compatibility; I guess that's why Qemu does not pass the sector size through by default. To do this however, we must take care to always access the disk using 512-byte sectors, especially from within the host.

saz commented 3 months ago

I'm also running into this issue and things are a bit more complicated.

The fix by @dpavlin works for me, to get instances up and running, but if I need to make any changes on an instance volume, it's broken on hosts with 4k block size:

root@gnt14 /var/log/ganeti/os # kpartx -a /dev/vg0/fc0b9405-bd2a-4f83-8a8e-1c766de2d5a1.disk0
device-mapper: reload ioctl on vg0-fc0b9405--bd2a--4f83--8a8e--1c766de2d5a1.disk0p1 (253:16) failed: Invalid argument
create/reload failed on vg0-fc0b9405--bd2a--4f83--8a8e--1c766de2d5a1.disk0p1

To work around this issue, a loop device can be used:

root@gnt14 /var/log/ganeti/os # losetup --show -f /dev/vg0/fc0b9405-bd2a-4f83-8a8e-1c766de2d5a1.disk0
/dev/loop0
root@gnt14 /var/log/ganeti/os # kpartx -a /dev/loop0
root@gnt14 /var/log/ganeti/os #

Instead of repartition in unmap_disk0, we might just rely on a loop device in all cases, which seems to default to 512 bytes as sector size. This way, we only need to partition the block device once in format_disk0

ganeti / instance-debootstrap