Support 4k storage - Githubissues

ij1 commented 5 years ago

Qubes OS version

R4.0

Affected component(s) or functionality

VMs not working/starting right from a fresh install.

Brief summary

Right after a fresh install, all VMs fail to mount root and therefore fails to start beyond the point where they expect /dev/xvda3 available. This happens on a device that has 4kB logical and physical block sizes (NVMe drive). This was not problem in R3.2 (as it used files by default for VM storage).

To Reproduce

Steps to reproduce the behavior:

Install Qubes to a drive with 4kB sector size (both logical / physical); (I put /boot to a SATA drive with 512B sectors to avoid BIOS/NVMe boot challenges, rest of the system is on the NVMe with 4kB sectors).
Firstboot stuff fails
After clicking "finish" for firstboot, find out that no VM will start successfully (which explains firstboot failures I guess)
Look to the VM logs, and find this from there:

[    0.887548] blkfront: xvda: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.902355] blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.924386] blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
[    0.940325] blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
Waiting for /dev/xvda* devices...
Qubes: Doing R/W setup for TemplateVM...
[    1.049451] random: sfdisk: uninitialized urandom read (4 bytes read)
[    1.052481]  xvdc: xvdc1
[    1.060250] random: mkswap: uninitialized urandom read (16 bytes read)
Setting up swapspace version 1, size = 8 GiB (8589930496 bytes)
no label, UUID=...
Qubes: done.
mount: wrong fs type, bad option, bad superblock on /dev/xvda,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
Waiting for /dev/xvdd device...
mount: /dev/xvdd is write-protected, mounting read-only
[    1.099814] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem
[    1.106796] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Opts: (null)
mount: /sysroot not mounted or bad option

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
[    1.119049] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e335a008d5, max_idle_ns: 440795216613 ns
mount: /sysroot not mounted or bad option

       In some cases useful info is found in syslog - try
       dmesg | tail or so.
switch_root: failed to mount moving /sysroot to /: Invalid argument
switch_root: failed. Sorry.
[    1.217841] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
...

Expected behavior

VMs would start. Firstboot stuff would work. Drives with 4kB sector size would work.

Additional context

I've tracked this down to the handling of the partition table. With 512B sectors the location of the GPT differs from that of with 4kB sectors and therefore VMs fail to find the correct partition table from xvda. Obviously also the partition start/end values will be off by the factor of 8 because the templates are built(?) with an assumption of 512B sector size.

I'm not sure if there are other assumptions based on 512B sectors with the other /dev/xvd* drives.

Solutions you've tried

I cloned a template and I tried to manually fix the partition table of the clone (in dom0 through /dev/qubes_dom0/...). There's was plenty of space before the first partition, however, at the end the drive is so tight on space that the GPT secondary table won't fit so the xvda3 partition's tail was truncated slightly and I didn't try to resize its filesystem first (this probably causes some problems, potentially corruption?). With such a fixed partition table, I could start VMs (but there are then some other problems/oddities that might be due to incomplete firstboot or non-fixed fedora template, I only fixed the debian one which I mainly use normally). I could possibly enlarge the relevant LV slightly to avoid the truncate problem at the tail of xvda3 but I've not tried that yet.

I tried to look if I could somehow force pv/vg/lv chain to fake the logical sector size but couldn't find anything from the manpages.

Libvirt might be able to fake the logical_block_size but I've not yet tried that.

Relevant documentation you've consulted

During install, I used the custom install steps to create manual partitioning (but I think it is irrelevant).

Related, non-duplicate issues

None I could find, some other issues included failure to mount root successfully but the causes are different.

Decided solution

Add a partition table conversion to initramfs. Specifically, write a tool that would check if partition table matches current block size. If it matches, do nothing. If not, convert it to the right block size format before mounting anything. And destroy the wrong partition table (if isn't directly overridden by the converted one) to prevent confusion which one is the current one.

References: https://github.com/QubesOS/qubes-issues/issues/4974#issuecomment-482897265 https://github.com/QubesOS/qubes-issues/issues/4974#issuecomment-1677356693

marmarek commented 5 years ago

Sector size is advertised by the block backend in xenstore (xenstore-ls /local/domain/0/backend/vbd/$DOMID/51712), but I don't see any option to force specific value.

This issue is really unfortunate, because a lot of places in Qubes assume you can freely transfer disk image and it will work just fine. This include cloning VMs (including cloning to a different storage pool), backup/restore etc. So, the solution here should be either:

find a way to construct a disk image to work on both 4k and 512 sector size
force VM to see 512 sector size

The second one may come with a performance penalty. The first one would not have this problem, but not sure if it's possible. I'm fine with making partition table 4K aligned, as long as it will also work with 512 sector size. But it isn't clear to me it would be enough.

Partition table and filesystem are built here: https://github.com/QubesOS/qubes-linux-template-builder/blob/master/prepare_image#L63-L83

Another idea would be to revert to a filesystem directly on /dev/xvda (without any partition table). This may not be as simple as it sounds, because we need to fit grub somewhere (with HVM with in-VM kernel case).

But this all may not work for other cases, including other OS. Imagine installing some OS (Linux, Windows, whatever) in a standalone HVM and then moving it to another storage pool (or restoring a backup on another machine). Those cases may require emulating constant sector size.

Sadly, I don't have any hardware with 4k physical sector size to test on. I'll try to find a way to emulate one.

BTW, another issue from 4k sector size is 8GB of swap, instead of 1GB. But this should be easy to fix in this script

marmarek commented 5 years ago

A lot of useful info: https://superuser.com/questions/679725/how-to-correct-512-byte-sector-mbr-on-a-4096-byte-sector-disk There is also a script to parse 512-byte GPT on 4k disk (and map it using loop devices). Using this, one workaround would be to adjust init.sh to rewrite GPT if sector mismatch is detected (in either direction). This require the partitions to be 4k aligned before, but it should be doable. But this is far from a complete solution, given non-template-based Linux use cases.

ij1 commented 5 years ago

There's not much to worry about 4k alignment, it is already there in the template: what I gathered, the partition table tools nowadays will enforce at least 4k alignment and they will warn if that would be violated (some might do even larger alignment). This is why I managed to rewrite the template's partition table in the first place so easily (except the truncate issue).

I don't think forcing 512 sector size itself would come with a large penalty as in practice the filesystems inside will use something larger than 512 (depending on how all relevant block stuff handles the larger continuous units of course but I'd guess that would not cause performance problems). So it would be mostly relevant for booting up correctly. What I'd rather avoid though, is forcing my drive's firmware to use 512 sector size as it would explore less tested corners of the firmware and possibly have significant performance impact too (I know my NVMe drive could do 512 but I don't know if all 4k drives are able so this needs to be handled anyway).

~~Btw, the USB HDDs might expose 4kB when not behind the SATA-to-USB converter, perhaps you have one of them which you can disassemble to get such a device?~~ (losetup seems able to fake it as noted below)

Like I said, libvirt supposedly has a way to configure logical_block_size but I don't know if that is able to fake it for real: https://libvirt.org/formatdomain.html ...or is that only for KVM?

I'll probably try to use the file backend (that's what was used in R3.2 right?) for the main system for now (the NVMe drive should clone fast anyway :-) so the biggest downside I know of is a non-issue). Can the installer do that automatically if I simply reinstall (that is, how it chooses which type of storage pool to use by default) or do I have to manually setup everything afterwards skipping the firstboot stuff to avoid it failing? I can then look into the 4k stuff while other stuff keeps working fine with 512. That would also allow me to easily test cross 512 and 4k copying but that looks rather scary to begin with, so far about nothing seems portable from one sector size to another from what I've read.

If the partition table would be removed from xvda, the grub might have a similar 4k vs 512 issue anyway so that might not solve anything (sector was mentioned somewhere when I tried to look into what kind of information format it uses which sound bad) but this needs a deeper investigation.

ij1 commented 5 years ago

Losetup seems able to fake logical block size:

https://github.com/karelzak/util-linux/commit/a1a41597bfd55e709024bd91aaf024159362679c

marmarek commented 5 years ago

...or is that only for KVM?

Yes, I think it's KVM only.

Can the installer do that automatically if I simply reinstall

If you choose btrfs during installation, Qubes will use that instead of LVM.

ij1 commented 5 years ago

Could not the faking be done other way around? I'd feel by intuition that in block code log4096 -> phy512 is far simpler than log512 -> phy4096.

Or is there some particular reason why 512 is still needed for the VM disk format that is almost internal to Qubes. VMs will obviously see the end result but they should have little reason to change how the sector size is defined by the "internal" format. Or is there some other OS that only works with 512?

That would leave just a few things to address:

block support for log4096 -> phy512 (or any 2^n to be more generic).
Template building code to create 4k GPT (sfdisk doesn't seem to support forcing sector size unlike some other partition tools but losetup could be used to fake it).
Fix the 512 assumptions (such as the one with swap partition sizing), hopefully not that many
One-time migration (at new release?) which would be at most as complex as a GPT rewrite workaround needed for supporting both sizes, probably somewhat less.

marmarek commented 5 years ago

Or is there some particular reason why 512 is still needed for the VM disk format that is almost internal to Qubes

I'm not sure about disks emulated by QEMU. And then windows PV drivers. Recently I've seen some patches flying around fixing 512 sector size assumption somewhere there, so there still may be more issues like this. Given various elements involved, I think 512 is simply safer in terms of compatibility.

brendanhoar commented 5 years ago

Can I throw in another alignment data point to consider: the LVM chunk_size, which can range from 64KB to 1MB.

Policy-wise, Qubes may want to consider ensuring that any physical partitions (or partitions inside lvm LVs), that are created by qubes tools and/or installer, are 1MB aligned, primarily for performance reasons. Probably not as critical as the baseline fixes to ensure 4K logical sector drives work, but since that requires changes, consider enforcing a much more strict alignment going forward (see the volatile volume issue #5151).

Brendan

arno01 commented 4 years ago

If anyone needs 4Kn templates right now, can use my patch from https://gist.github.com/arno01/ae31e1e9098591dadde3d1fc8c707000

I have also found that partprobe will fail to spawn the partitions off loop devices created with the custom sector size (losetup: -b / --sector-size) not corresponding to the sector size of the backing disk on Linux < 4.18-rc4.

And there is some interesting discussion about the 4Kn sector disks. IIUC, the point Alan Cox makes there is that this kind of problem should be solved at the partitioning level, not at the xenbus / LVM / Linux kernel.

rustybird commented 1 year ago

This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Ideas:

Invoke cryptsetup luksFormat with an explicit --sector-size=512 argument for the LVM Thin installation layout (fix for 512e drives)
Or attach thin volumes to the VM via ~loop devices~ dm-ebs (fix for 4Kn and 512e drives)
- With an optimization to skip the ~loop device~ dm-ebs setup (passing through the thin volume) when it already has the right logical block size, and a Volume.logical_block_size property (defaulting to 512) it could be a way to gradually opt into 4K storage volumes in general.

brendanhoar commented 1 year ago

I wonder if Qubes pools should specify the sector size of their underlying storage technology, and whether importing volumes should involve a conversion step?

B

rustybird commented 1 year ago

Conversion during import would mean parsing VM data in dom0 😬

Or a DisposableVM I guess.

rustybird commented 1 year ago

Ok someone should definitely write a DisposableVM-powered converter for common volume contents.

But automatic conversion won't be possible in all cases (like standalone HVMs where a volume could contain anything, e.g. bs dependent filesystems like XFS that might not be straightforward to upgrade), so even with a very good converter there's still a need for

per-volume metadata recording the appropriate bs for its current content
a mechanism to present the volume to the VM with that bs, even if the storage pool's ideal bs is different, e.g. after restoring from a backup

HW42 commented 1 year ago

This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Interesting. Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector (in other words you should always see either the version before the write or a fully updated sector, but not a mix). So a proper implementation likely needs a journal.

a mechanism to present the volume to the VM with that bs, even if the storage pool's ideal bs is different, e.g. after restoring from a backup

At lest on dm level support seems to exists: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-ebs.html

rustybird commented 1 year ago

cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector

I'm kinda curious too about how writes really work for kernel -> 512e drive communication.

Pure speculation: Since both the kernel and the drive know that the drive's physical block size is 4K, maybe the kernel just always writes batches of 8 * 512B logical blocks - and when the drive sees logical blocks coming in fast enough, one immediately following another, it figures out that read-modify-write can be avoided? Or there could be some explicit way for the kernel to signal to the drive that it's aware of 512e and that it guarantees to send 4K blocks merely encoded as batches of 512B blocks.

https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-ebs.html

Huh. Thanks! Wonder if that's better than a loop device.

DemiMarie commented 1 year ago

I can think of at least two solutions:

Place the partition table for a 512e device between the protective MBR and the 4Kn GPT. There are 7 512-byte sectors in this space, which allows for up to 6 partitions. This is enough to fit all three partitions used by Qubes OS, plus one extra covering the 4Kn partition table. The only problem with this approach is that if the block size is 4K, the protective MBR will appear to extend past the end of the device. I suspect this is harmless.
Use a partition table that is not part of the image, but instead is overlayed on it at runtime. This can be done by using dm-linear.

rustybird commented 1 year ago

@DemiMarie I don't get (1). Would there be some script in the VM's initrd to rewrite the partition table ("activating" the stashed away 512B or 4K version) depending on xvda's current logical block size?

Dynamically switching back and forth between 512B and 4K partitioning in general seems like it could make resizing the volume (resize-rootfs-if-needed.sh and resize-rootfs) a little scary...

marmarek commented 1 year ago

Generally, I'd try to avoid any kind of conversion at startup and go for emulation when necessary. That means:

recording expected block size by the volume
emulating it, if the underlying pool has different block size

And then, either build templates with two flavors, or convert at the install time (as part of qvm-template-postprocess), if reasonably easy.

Can we get away without emulating 4k bs on 512 bs devices?

rustybird commented 1 year ago

Once there's a way to attach a volume as 4K, why even bother building (or converting to) 512B templates.

Can we get away without emulating 4k bs on 512 bs devices?

Forcing --sector-size=4096 for luksFormat in the installer (or reencrypt in the upgrade script) even on drives reporting 512B physical sectors would have the same effect.

I'd guess almost all of those drives (that make sense to install a Qubes storage pool on) actually have 4K physical sectors anyway, but it's misreported by shoddy firmware or an adapter.

rustybird commented 1 year ago

Ah wait, I haven't tested if reencrypt --sector-size=4096 would work to bump up the logical block size of an existing LVM Thin pool. I'll try it today or tomorrow.

HW42 commented 1 year ago

This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives.

Interesting. Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector (in other words you should always see either the version before the write or a fully updated sector, but not a mix). So a proper implementation likely needs a journal.

To answer my own question, they just ignore the problem. From the manpage:

Note that if sector size is higher than underlying device hardware sector and there is not integrity protection that uses data journal, using this option can increase risk on incomplete sector writes during a power fail.

I'm kinda curious too about how writes really work for kernel -> 512e drive communication.

I'm not totally sure, but I assume the drive really emulates 512 B sectors (so the kernel does nothing special, but you of course have usually aligned and bigger writes from the filesystems) and the internal handling is up to it's firmware.

I'd guess almost all of those drives (that make sense to install a Qubes storage pool on) actually have 4K physical sectors anyway, but it's misreported by shoddy firmware or an adapter.

I think the logical sector size can't be missreported since this would break addressing. "Physical" sector size is of course up to what the firmware says. Most drives will be flash anyway and AFAIU they usually aren't 4096 sectors nowadays but something bigger. Given all the handling you need due to big erase blocks, wear leveling, caching, etc. I don't think it matters much. FWIW: I just checked two Samsung SSDs and they both report 512 B logical and physical sector size. One of them is a NVMe drive. A while back I checked and you couldn't reformat it to 4096 B sectors via nvmectl. I guess for Samsung's consumer series it's just easier for them to just support one config. Might be different for "data center" stuff.

rustybird commented 1 year ago

Might be different for "data center" stuff.

Thanks :roll_eyes: Samsung! I just checked /sys/block/DEVICE/queue/*_block_size for an older consumer 850 Pro and indeed it's 512 logical / 512 physical where my daily driver, a datacentery PM883 says 512/4096. Both SATA.

But a consumer NVMe by Western Digital also said 512/4096.

a block device should guarantee atomic writes per sector

Come to think of it, was that ever something to be relied on? Like, even in a pure 512B world of ancient spinning rust - partially written or otherwise corrupted blocks are legal (and not unusual) in case of a power outage, no? And block device users like filesystems must cope with this through checksummed journals or other suitable data structures.

brendanhoar commented 1 year ago

Almost all storage primary devices produced today are (at least internally) logically 4096 and present externally as either 512e (emulated logical 512) or 4Kn (native).

Physically it's even weirder since now both flash and spinning platters on primary storage can have larger physical minimal write sizes under some/all circumstances (and much larger physical erase/zeroing sizes for flash, such as 1MB or 4MB). Also the SMR disks are super-super-weird.

But we mostly have to worry about whether the pool is on a 512n, 512e or 4Kn device.

I think the safest thing to do is force everything (partitions, crypto overlays, pools, templates Volumes, filesystems) to aligned, 4K block sizes up and down the stack.

For better erase-block-faster-after-discard support, keeping partitions/overlays/volumes 1MB or 4MB aligned might be even better.

Possibly keep around a 512 pool (or allow creation of an ephemeral 512 pool before conversion) for restoring/importing 512-based VMs but with the knowledge that those will run non-optimally until converted. And direct-io would be disabled for such a pool on a 4Kn device.

I think the future is 4Kn. Near future anyway.

B

DemiMarie commented 1 year ago

@DemiMarie I don't get (1). Would there be some script in the VM's initrd to rewrite the partition table ("activating" the stashed away 512B or 4K version) depending on xvda's current logical block size?

No, both partition tables would be present at all times. The logical block size determines which one the kernel will parse.

HW42 commented 1 year ago

No, both partition tables would be present at all times. The logical block size determines which one the kernel will parse.

Neat idea, but not sure if that's a good path to take. Sounds like this will blow up at some point because something doesn't handle it like expected.

Thanks roll_eyes Samsung!

Given that it's flash and 4096 B sectors aren't really accurate either I don't think it matters. Anyway this is getting a bit off topic, we have to expect hardware that reports all variants of common logical/"physical" sector sizes.

[...] keeping partitions/overlays/volumes 1MB or 4MB aligned might be even better.

fdisk/parted/etc. default since quite a while to aligning to 1 MiB. So this should be the case anyway (although I didn't double check all templates). The question here is about fixing addressing when logical sectors size isn't 512 B.

a block device should guarantee atomic writes per sector

Come to think of it, was that ever something to be relied on? Like, even in a pure 512B world of ancient spinning rust - partially written or otherwise corrupted blocks are legal (and not unusual) in case of a power outage, no?

I assume in that case either the drive has it in some non-volative cache, has enough energy to finish write it to something non-volatile or the internal error correcting codes will not checkout and you get a read error.

And block device users like filesystems must cope with this through checksummed journals or other suitable data structures.

So I have read a bit what people write about that topic and it seems that while sector (or bigger) write atomicity is sometimes provided in general you can't really assume it unless you really checked your stack from application down to the drive firmware (and most of the time the answer will be "don't know").

Given that the cryptsetup developers decided that it's an acceptable trade-off moving to always emulate 4 KiB sectors in a future release sounds more attractive to me now. We would still need some emulation for booting restored backups or similar and if feasible a conversion tool might be handy.

DemiMarie commented 1 year ago

No, both partition tables would be present at all times. The logical block size determines which one the kernel will parse.

Neat idea, but not sure if that's a good path to take. Sounds like this will blow up at some point because something doesn't handle it like expected.

I recommend checking that at least Linux and systemd-gpt-generator handle it properly. We will of course need to implement our own code for writing these hybrid partition tables, but existing tools should have no problems reading them, at least if I understood the GPT specification correctly.

And block device users like filesystems must cope with this through checksummed journals or other suitable data structures.

So I have read a bit what people write about that topic and it seems that while sector (or bigger) write atomicity is sometimes provided in general you can't really assume it unless you really checked your stack from application down to the drive firmware (and most of the time the answer will be "don't know").

Some more information:

ext4 blindly assumes that 4K overwrites are atomic, no matter what the logical sector size is.
XFS is aware of the logical vs physical sector size distinction. It will try to use physical-sector writes where possible, but only assumes atomicity of logical-sector writes.
I am not sure about BTRFS, ZFS, LVM, or dm-thin. With BTRFS and ZFS at a minimum one should get an error instead of wrong data.

DemiMarie commented 1 year ago

I'd guess almost all of those drives (that make sense to install a Qubes storage pool on) actually have 4K physical sectors anyway, but it's misreported by shoddy firmware or an adapter.

I think the logical sector size can't be missreported since this would break addressing. "Physical" sector size is of course up to what the firmware says. Most drives will be flash anyway and AFAIU they usually aren't 4096 sectors nowadays but something bigger. Given all the handling you need due to big erase blocks, wear leveling, caching, etc. I don't think it matters much. FWIW: I just checked two Samsung SSDs and they both report 512 B logical and physical sector size. One of them is a NVMe drive. A while back I checked and you couldn't reformat it to 4096 B sectors via nvmectl. I guess for Samsung's consumer series it's just easier for them to just support one config. Might be different for "data center" stuff.

There was a paper (I will try to find a link later) that found that on some NVMe devices, 512B writes were atomic while 4096B writes were not.

DemiMarie commented 1 year ago

Overall, I think these kind of problems are a good reason to switch to BTRFS-by-default, at least if BTRFS actually solves them. Since BTRFS is copy-on-write, it can (at least in theory) provide atomicity guarantees for arbitrary-sized writes, and I hope that it provides them for 4K writes at a minimum.

HW42 commented 1 year ago

Overall, I think these kind of problems are a good reason to switch to BTRFS-by-default [...]

Uhm, I think we are getting a bit off-track here. This issue is about things not working on devices that report 4096 B logical sectors. Switching the default storage pool wouldn't even solve this, beside the other questions it raises (this doesn't mean it's a bad idea, but not for this issue).

We have VM volumes that has been created with 512 B sectors, so there will need to be some way to boot them (universal converter seems not practical, a tool to convert standard volumes might make sense).

I think there are 3 general options:

passthrough what the underlying device reports and translate if the (to be added) metadata of the VM volume says the VM expects something else.
always expose 512 B sectors to VMs
always expose 4096 B sectors to VMs

1. sounds complicated but it's an options. And since templates are read-mostly, maybe we wouldn't even need to provide 2 variants of them.

2. should be safe but might have some performance impact.

I was initial skeptical of 3. because of the atomic write topic, but as I wrote above, after reading more it doesn't seem to be a big deal to me.

rustybird commented 1 year ago

I haven't tested if reencrypt --sector-size=4096 would work to bump up the logical block size of an existing LVM Thin pool.

Moving an existing LVM Thin pool from a logical 512B block device to a logical 4K block device appears to work fine and does bump up the logical block size

Edit: TODO: test online LUKS2 reencryption with an active pool hosting a mounted dom0 filesystem

Edit 2: That worked too. It takes a reboot for the new logical block size to become active. (Reencryption speed was 104 MiB/s on a T420, btw.)

rustybird commented 1 year ago

Here's a written down concrete proposal of the sketched out transition to 4K storage, so my head doesn't explode.

It would fix 4Kn drive lvm_thin incompatiblity (this issue), prevent the upcoming 512e drive lvm_thin incompatibilty, and accomodate the corner case of unencrypted (or otherwise unable to be reencrypted to 4K) lvm_thin pools on 512B drives.

Let me know what I've missed.

1. Add `Volume.logical_block_size` property

possible values 0, 512, 4096
default value 0 in Volume.__init__()
volume cloning preserves source value
expose setter via Admin API
- qvm-volume config
qvm-template(-postprocess) support
- assign value for 'root' from template package metadata (512 if missing)
backup/restore support
- preserve value for 'root' and for 'private'
- if feasible, preserve nonzero value for 'volatile' object (see "rare case" in section 4)
- Wyng?
forum: notify existing users of custom built 4K templates that they should assign 4096 in preparation
- also after they restore already backed up (before property implementation) custom built 4K templates in the future

2. Storage driver support for `Volume.logical_block_size`

treat value 0 as 512
file-reflink, 'linux-kernel': losetup --sector-size=... in /etc/xen/scripts/block
- old losetup triggers a race in old kernels with that argument
'file': ?
lvm_thin: dm-ebs (can emulate in either direction)
ephemeral_volatile
- call cryptsetup --type=plain with --sector-size=4096 (fc32 manpage is outdated, it works)
- optional optimization for 512 as 4K: skip dm-ebs layer, 4K dm-crypt is enough
- 'volatile' size must be a multiple of 4K (neither dm-ebs nor dm-crypt allow unaligned size)
- lvm_thin: looks like LVM Thin already aligns size to 4 MiB
- file(-reflink): loop device always aligns size to 512

3. Optimize Qubes OS installation

installer: hardcode cryptsetup luksFormat --sector-size=4096 (if encryption is used)
- raw device size must be a multiple of 4K or cryptsetup will fail
- ensure good partitioning
qubes-dist-upgrade: optional final cryptsetup reencrypt --sector-size=4096 phase where feasible
- not if XFS is dom0 filesystem
- tricky if raw device size isn't a multiple of 4K
- shrink the LUKS2 header?
- offline reencryption instructions for old LUKS1 users coming from R4.0?

4. Move to 4K storage volumes

new and existing 'kernel' volumes: in LinuxModules.__init__(), override value 0 to 4096
new and existing 'volatile' volumes: in Volume.__init__(), override value 0 to 4096 on name match
- dependency on new enough kernel-(latest-)qubes-vm / qubes-kernel-vm-support
- rare case: standalone HVMs etc. doing custom partitioning of 'volatile' on startup
- it's just 'volatile' - might be okay to break, and require user to assign 512 or fix their custom script?
new volumes: create with value 4096 (cloning/restoring might then assign something else)
build 4K template packages
everything else: handle easy cases, provide user guidance and converters
- existing 'private'
- could be custom incompatible filesystem (XFS) etc.
  - VM init script advertising volume-private-supports-logical-block-size=4096 feature if standard filesystem detected?
- mass assignment command for people who know they're only using normal 'private' volumes
- existing 'root' in TemplateVMs and StandaloneVMs
- converter scripts

brendanhoar commented 1 year ago

hardcode cryptsetup luksFormat --sector-size=4096 in installer

Any other steps required for non-encrypted installs?

B

rustybird commented 1 year ago

No - I've edited that point to clarify.

tlaurion commented 1 year ago

cryptsetup --device-size=... appears to be broken for this purpose

report bug

That is partly covered here: https://gitlab.com/cryptsetup/cryptsetup/-/issues/585

Which requires manual alignment and lead to open https://forum.qubes-os.org/t/ssd-maximal-performance-native-sector-size-partition-alignment/10189/25 for misalignment issues and needs for manual calculation since sfdisk and other tools were not properly dealing with partition table and partition alignments.

brendanhoar commented 1 year ago

That is partly covered here: https://gitlab.com/cryptsetup/cryptsetup/-/issues/585

Heh, as an aside, I'm fairly certain this cross-manufacturer usb-SATA bridge behavior (stealing the last sector) exists in some chipsets to support saving state (likely not used by most variations of the bridges, even, but e.g. encryption/password) without having to include eeprom or NOR flash on the chipset. Possibly to ensure portability of that state with the drive (within the chipset anyway).

The amount of chaos it was to cause, however, was likely not forecast by the original engineers...

If not done already, it would be wise to ensure the remnant non-alignment friendly tails at the end of the storage devices are always excluded from usage via partitioning excluding them.

B

rustybird commented 1 year ago

https://gitlab.com/cryptsetup/cryptsetup/-/issues/585

Thanks! Latest updates to the proposal:

workaround for cryptsetup is to ensure good partitioning
clarified default value mechanism for the property, it was too vague
legacy 'file' driver doesn't use /etc/xen/scripts/block, but 'linux-kernel' driver does

marmarek commented 1 year ago

Sadly, I don't have any hardware with 4k physical sector size to test on. I'll try to find a way to emulate one.

Here is openQA run with 4kn emulated disk: https://openqa.qubes-os.org/tests/52712 fdisk there reports:

fdisk -l /dev/nvme0n1
fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 80 GiB, 85899345920 bytes, 20971520 sectors
Disk model: QEMU NVMe Ctrl                          
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

And every VM fails to start, as expected.

Note to self: set HDDMODEL=nvme,physical_block_size=4096,logical_block_size=4096

marmarek commented 1 year ago

Given the failure mode on 4.2 is worse than on 4.1, I think we should have it in 4.2. The plan outlined by @rustybird in https://github.com/QubesOS/qubes-issues/issues/4974#issuecomment-1290891792 looks good. @rustybird are you up for implementing this?

DemiMarie commented 1 year ago

@marmarek: what about write tearing? 4K sector writes on a 512e driver are not guaranteed to be atomic, and IIRC are not atomic on some low-end SSDs in the event of power failure. XFS takes precautions against this.

rustybird commented 1 year ago

Maybe we should get the R4.2 regression affecting 512e disks out of the way first, by hacking the installer to use cryptsetup luksFormat --sector-size=512 whenever the LVM Thin layout has been selected and the destination disk has a 512-byte logical block size.

But yes I'm also still interested in starting on the proposal's phase 1 and 2 at least, to fix the existing lvm_thin (and zfs driver too?) incompatibility with 4Kn drives. Not sure how long it will take though.

Phase 3 and 4 is where things would get spicy with the whole atomicity question. If a Qubes storage volume is exposed to the VM as a 4K block device even if the disk hardware might not provide atomic 4K writes, will this cause the filesystem on the volume to falsely rely on 4K writes being atomic when it otherwise wouldn't have - either for its own purposes in attempting to preserve its data structures' integrity, or as some sort of ineffective guarantee to an application writing data into a file? Tbd... There's an interesting writeup: https://stackoverflow.com/a/61832882

marmarek commented 1 year ago

Maybe we should get the R4.2 regression affecting 512e disks out of the way first, by hacking the installer to use cryptsetup luksFormat --sector-size=512 whenever the LVM Thin layout has been selected and the destination disk has a 512-byte logical block size.

I tried in https://github.com/QubesOS/qubes-anaconda/pull/28 But it didn't worked: https://openqa.qubes-os.org/tests/80043:

[2023-08-14 05:27:22] Waiting for /dev/xvda* devices...

[2023-08-14 05:27:22] Qubes: Doing R/W setup for TemplateVM...

[2023-08-14 05:27:23] [    2.836815]  xvdc: xvdc1 xvdc3

[2023-08-14 05:27:23] Setting up swapspace version 1, size = 1073737728 bytes

[2023-08-14 05:27:24] [    3.052640] random: crng init done

[2023-08-14 05:27:24] UUID=50685a9b-25ff-4cb6-b107-7170602a08e6

[2023-08-14 05:27:24] Qubes: done.

[2023-08-14 05:27:24] mount: mounting /dev/mapper/dmroot on /sysroot failed: Invalid argument

[2023-08-14 05:27:24] Waiting for /dev/xvdd device...

[2023-08-14 05:27:24] [    3.086464] /dev/xvdd: Can't open blockdev

[2023-08-14 05:27:24] [    3.086937] EXT4-fs (xvdd): mounting ext3 file system using the ext4 subsystem

[2023-08-14 05:27:24] [    3.089277] EXT4-fs (xvdd): mounted filesystem with ordered data mode. Quota mode: none.

[2023-08-14 05:27:24] mount: mounting none on /sysroot/lib/modules failed: No such file or directory

[2023-08-14 05:27:24] [    3.172895] EXT4-fs (xvdd): unmounting filesystem.

[2023-08-14 05:27:24] mount: can't read '/proc/mounts': No such file or directory

[2023-08-14 05:27:24] BusyBox v1.36.0 (2023-01-10 00:00:00 UTC) multi-call binary.

[2023-08-14 05:27:24] 
[2023-08-14 05:27:24] Usage: switch_root [-c CONSOLE_DEV] NEW_ROOT NEW_INIT [ARGS]

[2023-08-14 05:27:24] 
[2023-08-14 05:27:24] Free initramfs and switch to another root fs:

[2023-08-14 05:27:24] chroot to NEW_ROOT, delete all in /, move NEW_ROOT to /,

[2023-08-14 05:27:24] execute NEW_INIT. PID must be 1. NEW_ROOT must be a mountpoint.

[2023-08-14 05:27:24] 
[2023-08-14 05:27:24]   -c DEV  Reopen stdio to DEV after switch

[2023-08-14 05:27:24] [    3.193656] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100

I think I did set it correctly:

root# cryptsetup luksDump /dev/nvme0n1p3; echo efDdj-$?-
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           c51bf27f-c7e1-443b-917d-b95db5145101
Label:          (no label)
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
    offset: 16777216 [bytes]
    length: (whole device)
    cipher: aes-xts-plain64
    sector: 512 [bytes]
...

Any ideas?

rustybird commented 1 year ago

@marmarek:

But it didn't worked: https://openqa.qubes-os.org/tests/80043:

The test had HDDMODEL configured with physical_block_size=4096,logical_block_size=4096, so it's emulating a 4Kn drive. To test 512e (in the sense that's causing the R4.2 regression due to new cryptsetup) it should be physical_block_size=4096,logical_block_size=512

Hardcoding --sector-size=512 doesn't help with 4Kn drives, because dm-crypt's sector_size cannot shrink the logical block size (compared to the underlying block device), we can only avoid enlarging it.

DemiMarie commented 1 year ago

@marmarek: I think overlaying two different GPTs is the only reasonable approach here. At some point everyone will be using bcachefs or another filesystem that does not care about sector size but we are not there yet.

marmarek commented 1 year ago

@marmarek: I think overlaying two different GPTs is the only reasonable approach here.

Is it only partition table issue? What about the filesystem? But also, the remark about dynamic resize is a valid one. Only one partition table will be updated, so if you migrate to a disk with a different logical block size, you'll get truncated partition. But if that's just about partition table, maybe we can do the conversion in initramfs before mounting anything? Assuming everything is 4k-aligned, it should be technically possible, right?

DemiMarie commented 1 year ago

@marmarek: I think overlaying two different GPTs is the only reasonable approach here.

Is it only partition table issue? What about the filesystem? But also, the remark about dynamic resize is a valid one. Only one partition table will be updated, so if you migrate to a disk with a different logical block size, you'll get truncated partition. But if that's just about partition table, maybe we can do the conversion in initramfs before mounting anything? Assuming everything is 4k-aligned, it should be technically possible, right?

Yup, it’s just about partition table, and we can avoid the dynamic resize problem by changing the partition table with our own tools that understand the different layout.

marmarek commented 1 year ago

@rustybird what do you think about adjusting partition table in initramfs?

rustybird commented 1 year ago

I like the simplicity of it, especially compared to my 4-phase slog of a proposal.

The adjustment script should probably bail out early if the root volume looks too nonstandard? E.g. if xvda3 is not an ext4 filesystem (or another filesystem type that's whitelisted as known to be logical-block-size agnostic).

DemiMarie commented 1 year ago

Which filesystem types should be on the allowlist?

rustybird commented 1 year ago

IIRC ext3 and btrfs are fine too, xfs definitely isn't

maybebyte commented 10 months ago

Overall, I think these kind of problems are a good reason to switch to BTRFS-by-default [...]

Uhm, I think we are getting a bit off-track here. This issue is about things not working on devices that report 4096 B logical sectors. Switching the default storage pool wouldn't even solve this, beside the other questions it raises (this doesn't mean it's a bad idea, but not for this issue).

Setting aside the question of what a good default would be for Qubes OS, using btrfs instead of LVM + ext4 actually does work around this issue on my 1 TB NVMe SSD (the model is a Sandisk Corp WD Blue SN570). It reports 4096 for both logical and physical sector size due to me following the instructions in the ArchWiki entry for Advanced Format a while back.[^1]

Anyway, I ran into the same issue that OP mentioned while installing R4.2 on my desktop using the default disk configuration scheme, but everything works OK so far on a reinstall using btrfs. I wouldn't have even thought to do this if not for this issue and this comment on a duplicate issue.

It makes me wonder whether using other filesystems that have similar properties to btrfs would work, such as ZFS (I'm not suggesting Qubes OS needs to adopt ZFS; I'm just thinking out loud here). Seems like adjusting the drive to emulate a sector size of 512 again may not be a viable workaround based on others' comments here, but I haven't tested it.

[^1]: I'm aware that Advanced Format is a hard drive specific thing, but that's what the wiki entry is called.

QubesOS / qubes-issues

Support 4k storage #4974

1. Add `Volume.logical_block_size` property

2. Storage driver support for `Volume.logical_block_size`

3. Optimize Qubes OS installation

4. Move to 4K storage volumes

QubesOS / qubes-issues

Support 4k storage #4974

1. Add Volume.logical_block_size property

2. Storage driver support for Volume.logical_block_size

3. Optimize Qubes OS installation

4. Move to 4K storage volumes

1. Add `Volume.logical_block_size` property

2. Storage driver support for `Volume.logical_block_size`