coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

Partition Layout #18

Closed ajeddeloh closed 6 years ago

ajeddeloh commented 6 years ago

In converstations we had recently, we think that FCOS should have a default partition layout, similar to how CL has a standard fs layout since it provides consistency across bare metal and clouds and well as making the image "dd-able" directly to a drive (which makes installation trivial). Any further disk modification should be done via Ignition.

What should that partition layout look like?

My (quick and not fully thought out) proposal:

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT       (i.e. everything else)

Ideally we'd be able to move ROOT around using Ignition and re-deploying the OSTree to where we moved it to between the disks and files stages. If you're on a EFI system you could even wipe away BIOS-BOOT to make more room (not that its terribly large). There's some tricky cases with that which we're still exploring, but it should be possible at very least in simple cases.

dustymabe commented 6 years ago
1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT       (i.e. everything else)

sounds like a reasonable default.. just to be clear people would be able to add a separate mounted /var/ or /home/, etc, via ignition on boot but everyone would start from the same place ?

ajeddeloh commented 6 years ago

Yeah. Isn't /home actually just part of /var on ostree systems?

dustymabe commented 6 years ago

Isn't /home actually just part of /var on ostree systems?

it is but I think we've enabled (because people asked for it) to have /home/ (or rather /var/home/?) be a separate block device mount.

cgwalters commented 6 years ago

Do you see us adopting the GPT generator and avoiding the need to have the OS mounts in /etc/fstab too?

cgwalters commented 6 years ago

it is but I think we've enabled (because people asked for it) to have /home/ (or rather /var/home/?) be a separate block device mount.

Yeah, we support that; it was the primary rationale for the systemd fstab symlink patch.

ajeddeloh commented 6 years ago

We could say that all the standard top level directories on / except /var and /boot needs to be on the same partition (kinda like how we say you can't move USR-{A.B} on container linux). Since /var is empty by default and can be populated by systemd-tmpfiles users can add a mount unit for it and partition it however they want. You want a seperate partition for /var, /var/home and /var/srv? Go for it.

I'm not a huge fan of the GPT generator. It's pretty limited since it only supports a few partitions and reeks magic that users may not be aware of. Much better to have explicit mount units imho.

cgwalters commented 6 years ago

I'm not a huge fan of the GPT generator. It's pretty limited since it only supports a few partitions and reeks magic that users may not be aware of. Much better to have explicit mount units imho.

Yeah; agree with the magic aspect. Though the specific thing I don't like about it is that if one plugs in a backup drive into a system, you might end up having the old /home partition mounted into your current OS. The scheme actually says explicitly:

Since the GPT partition table scheme cannot express which sets of partitions belong to a single OS installation, and to avoid the risk of accidentally mixing and matching incorrect combinations of these partitions, we decided to not define auto-discovery of these partition types within this specification.

Having a "full installer" that auto-generates UUIDs for each partition and binds those in /etc/fstab avoids all issues like this (except when LVM is in use).

The only way I can think of to address this in a "dd install" is to generate the machine id at install time, walk the target disk post-install and change the GUID/mount units. Ug. I guess CL has gone a long time with the current approach, and while the "foreign partition problem" does occur in the real world, it's definitely possible to work around it.

ajeddeloh commented 6 years ago

CL makes use of labels instead of uuids for partitions*. Ignition can bake in a specified uuid to your Ignition config but that means your machines will all have the same one. Since everyone is running off the same dd'd image anyway there's already duplicate uuids across machines (hasn't been a problem as far as I've seen). Labels actually work pretty well for this case. It means that while different machines will have partitions with different uuids, all of your config files and such (e.g. /etc/fstab, mount units, etc) are the same across machines and contain no machine-specific bits.

We could use uuids if Ignition supported templating but thats a rabbit hole I really don't want to go down, especailly since I see the "no machine specific bits" to be a feature.

*for the root-on-raid case we actually have special type guids for raid devices containing the root. This allows us know which devices to start in the initramfs. FWIW I'd like to avoid this for FCOS. Hell, stick a config file in /boot some tool in the initramfs knows how to read for all I care.

JasonGiedymin commented 6 years ago

I have to chime in on this portion. Soon I will be open sourcing some lvm encryption work done on Atomic LVM managed volumes. We store keys remotely basically. During that exercise the part which stopped the system startup and decryption from being more elegant and robust were UUIDs (we're on baremetal with diverse set of disks). If I had a more generic interface like labels to work with, live would have been easier.

cgwalters commented 6 years ago

One thing that's strongly implied by this is that Anaconda is probably not the primary path for bare metal installs; a whole goal of this is that "install" is just dd. By default we boot to a live image just like CL?

Side note: It'd be a nice twist to have the installer come as a privileged container.

However, a whole interesting question is how we generate that "base disk image" - I'd probably at least initially start with using Anaconda for it personally, but there's also the libguestfs-style disk image creation.

One question though; do we still support people using Anaconda (say that I want dm-crypt for everything, or XFS reflink=1, or...)? And there's a tension between Ignition for disk provisioning vs kickstart here.

bgilbert commented 6 years ago

@cgwalters Off-topic for this issue, but I don't see any reason to support installing via Anaconda, even optionally. If you want dm-crypt or XFS or whatever, we should be supporting that through Ignition.

@ajeddeloh Why don't you like the type GUID approach? It seems to work pretty well for CL.

ajeddeloh commented 6 years ago

Re installer in a container: we could package it as such, but it should also be so simple you don't need to. The current one is just a few hundred lines of bash (most of which is a ascii armored pubkey).

Re using anaconda bare metal installation: I really really really don't like this. Your Ignition config is the declaration of what your machine looks like; there shouldn't be anything else controlling that. It's also yet another thing to support. If we want to support dmcrypt or other filesystem/partition weirdness we should implement that in Ignition.

@bgilbert It's a hack from when we were thinking we shouldn't use the boot partition. We're going to need to store some config on the boot partition anyway for supporting encryption. It'd be much cleaner to just store a "map" for mounting the root partition in a config file. That would also eliminate the need to do GPT on RAID on GPT on Disks (and instead of GPT on RAID on Disks)

cgwalters commented 6 years ago

@cgwalters Off-topic for this issue, but I don't see any reason to support Anaconda, even optionally. If you want dm-crypt or XFS or whatever, we should be supporting that through Ignition.

I'd say it's a distinct thread but the installer path is pretty intertwined with our default partitioning. OK, with "full" Ignition-disks where we support re-locating the OS to the new root - yes. I know I keep getting hung up on this; the concept is foreign to me. I have concerns about the amount of overlap with Anaconda this is going to imply going forward, but...I also definitely see the value in the simplification here.

cgwalters commented 6 years ago

Soon I will be open sourcing some lvm encryption work done on Atomic LVM managed volumes. We store keys remotely basically.

Sounds like https://github.com/latchset/clevis ?

JasonGiedymin commented 6 years ago

I’ve modified https://github.com/HouzuoGuo/cryptctl to support LVM and stronger auditing with additional logging and events actions. It is deployed on all of our bare metal atomic nodes.

dustymabe commented 6 years ago

Discussed in the In meeting today. This is what we came up with:

Potential issues:

With 4k sectors we need to consider the GPT partition layout as well as the filesystem on top. related CL issue. We will address this issue when we hit it and call it out as a risk for now.

I believe @bgilbert also had another item in open floor that was relevant to this ticket that was regarding a user filling up the root partition and not being able to receive updates any longer.

cgwalters commented 6 years ago

How about

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT        3GB
4 - VAR  (rest of disk)

And we also make /etc a read-only bind mount by default (after Ignition has run). This would be a pretty aggressive stance; having a separate /var avoids downloading too many container images blocking OS updates. And IMO a read-only-by-default-/etc matches nicely with Ignition's configure-once model.

A big topic that crosses this though is whether or not we use LVM by default (and if we don't whether it can be configured via ignition). If neither, then that 3GB (or whatever) rootfs is going to feel less flexible.

bgilbert commented 6 years ago

My question in the meeting was: in CL, we can in principle continue to update a machine whose root filesystem is full, because we're only touching /boot and /usr and those are on separate partitions. Do we want to take steps to ensure the same for FCOS?

@cgwalters I like that model. In that case, the flexibility of the small root is only an issue for distro maintainers, since the user shouldn't be putting any data there. It's a potential concern (c.f. Fedora increasing the size of /boot some years ago) but seems like it should be manageable. I don't think LVM would help us anyway, since there's no guarantee that we can resize /var downward on existing systems if we later need a larger /.

JasonGiedymin commented 6 years ago

As long as I can dmcrypt /var later, and add lvm to new disks which later I can manage to dmcrypt.

Believe that for atomic now all users are mapped into that writable space in var. I think right now also var and sysroot are on a shared partion with atomic.

vtolstov commented 6 years ago

i'm use ostree on netbook with emmc, 3gb root is too small - not able to upgrade sometimes... I'm test on servers and netbook - minimal size of rootfs 5-6Gb, and i'm vote for lvm.

dustymabe commented 6 years ago

i'm use ostree on netbook with emmc, 3gb root is too small

is /var/ on separate filesystem for you?

vtolstov commented 6 years ago

/var on rootfs /var/home on separate lv (10Gb)

dustymabe commented 6 years ago

/var on rootfs

yeah. this is something colin is proposing we change to not be on rootfs by default, which would mean we would require less space for root.

vtolstov commented 6 years ago

@dustymabe no i don't think that this layout change things: in my case: sudo du -hsx /var/: 468M /var/ all space used by ostree files: sudo du -hsx /ostree/: 2.8G /ostree/

ajeddeloh commented 6 years ago

Regarding LVM and Ignition. I want that to happen. Much like the partitioning work, it's going to be tricky to implement, but imo it's 100% worth doing. That being said I don't think we should have our standard partition layout be LVM based. Keep it simple.

Regarding moving /var to a seperate partition: I'm in favor of this. Not only does it help (although not completely avoid) the issue of / filling up, it's similar to how you can blow away the root fs on CL which would be nice for existing CL users (as well as being nice in general).

Re: 4k sectors and GPT. I think we can ship a disk that supports both. The gpt spec lets you move the partition table around and only the actual contents of the GPT header are included in the CRCs (not the entirety of the sector). The only fixed things are that the GPT header must be at LBA1 and the backup must be at the last LBA. Since the header is <= 512 bytes, we can have both where they want to be for the primary and backup.

Here's what it would look like:

  Sector Size    |
512    |  4096   | Contents
----------------------------
 0     |   0     | MBR (protective gpt)
 1     |   0     | GPT Header for 512 byte sectors
 2-7   |   0     | Unused
 8     |   1     | GPT Header for 4k byte sector
 9-15  |   1     | Unsused
16-n   | 2-n/8   | Partition table
       |         |
...    | ...     | Padding, partitions, etc
       |         |
N-(n+8)|N-(1+n/8)| Backup Partition table
N-7    |   N     | Backup GPT for 4k sectors
N      |   N     | Backup GPT for 512 byte sectors

There's a few problems/risks: 1) I don't know of any partitioning tool that supports creating this 2) On a 4k disk only the 4k header will be updated. Likewise ona 512 byte disk only the 512 byte header will be updated. This means as soon as you change a partition or even the disk guid the other one becomes invalid. This is fine imo. 3) Some partitioning tools might not implement the spec correctly and might expect the partition table / backup table to be at LBA2 / LBAn-1. We should audit the common ones.

cc @lucab for the GPT stuff.

dustymabe commented 6 years ago

closing this ticket as we've decided that a static partition layout is suitable for FCOS. Implementation details can be worked out later I believe.

ajeddeloh commented 6 years ago

Writing down a couple things mentioned elsewhere about the 4k/512 hybrid plan for the record.

My proposal technically breaks the GPT spec since the space in a sector after the GPT bits is defined as being all zeros. Whether anything cares is another question. It would also use just about every "feature" GPT has, and thus be at risk of not working on machines with poorly implemented EFIs (or very well implemented EFIs that check things are zeroed accordingly).

cgwalters commented 5 years ago

Having a split /var is blocked on https://github.com/dustymabe/ignition-dracut/issues/18

cgwalters commented 5 years ago

The fire alarm went off at the Westford office today and I happened to be standing near Vivek Goyal and Mike Snitzer (kernel filesystem/block people). Mike in particular said that supporting both 4k and 512b in one disk image couldn't be done because the filesystems rely on sector writes being atomic.

It seems like the simplest plan is to just make two disk images?

ajeddeloh commented 5 years ago

Yeah probably. SGTM.

snitm commented 5 years ago

The fire alarm went off at the Westford office today and I happened to be standing near Vivek Goyal and Mike Snitzer (kernel filesystem/block people). Mike in particular said that supporting both 4k and 512b in one disk image couldn't be done because the filesystems rely on sector writes being atomic.

It seems like the simplest plan is to just make two disk images?

Right, sadly you cannot issue 512b IO to a native 4K device. A filesystem (e.g. XFS) that is formatted to use 512b assumes 512b is the atomic unit of IO. It'll fail to mount if the underlying device is actually a native 4K block device.

You might think to go the other way and try to format the filesystem with a 4K blocksize and use that single FS image for both 512b and 4K devices. BUT, there is increased potential for a partial 4K write to a 512b device to leave the device with 512b IOs having been written (yet the larger 4K being incomplete) -- this is also known as "torn writes".

cgwalters commented 5 years ago

I've been thinking about the dm-crypt aspect again. Some prior discussion is in https://github.com/coreos/ignition/issues/577

One thing I'm wavering on a bit is how clunky it feels to rewrite all of the operating system files on boot. If we're in a cloud scenario, we don't have a lot of choice unless we provide people a tool for creating new snapshots (big implications there).

On bare metal though, I think we could instead do a "minimal re-partitioner" (not quite an installer) that created dm-crypt on the target system, then took the raw disk image and mounted it, and did a filesystem-level copy.

(Aside: I believe Android images encrypt on boot when you initialize them the first time, and this is probably a lot nicer since they switched to using ext4 encryption. Although I'm not sure the OS is ever encrypted, it's dm-verity.)

cgwalters commented 5 years ago

(The reason I'm thinking about dm-verity is that there are definitely server-side uses for it, but I'd like Silverblue to inherit as much technology as possible from CoreOS, and dm-crypt is really quite important on client-side devices)

ajeddeloh commented 5 years ago

I hope we're using LUKS not just dm-crypt unless there's a good reason not to.

We want to ensure that Ignition remains the only "source of truth" for configuration. The Ignition config may not be known at install time, so we don't want to do anything special at install time.

I instead wonder if we could add an "optimization" to Ignition/the initramfs to detect if we're recreating the root and save the repo to a tmpfs or something similar, so if we blow away the root we can repopulate it from a local source instead.

Finally, what are the use cases for encrypting more than just /var?

cgwalters commented 5 years ago

I hope we're using LUKS not just dm-crypt unless there's a good reason not to.

Hmm; I had to look up the distinction layers here, I had always been using them interchangably.

Finally, what are the use cases for encrypting more than just /var?

One tricky thing is a lot of use cases want /etc encrypted too. Having /usr encrypted provides integrity at least, and some backstop confidentiality in case there are any secrets.

jlebon commented 5 years ago

How about

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT        3GB
4 - VAR  (rest of disk)

Hmm, we'll have to think long and hard before choosing a size for ROOT. We don't want to realize down the line that e.g. the f31 -> f32 update won't fit. One issue too is that there might be more than just two commits in there. E.g. layered pkgs & pinned deployments (and I think there were discussions making the number of rollback deployments configurable?). So settling on the right size will be tricky no matter what.

ajeddeloh commented 5 years ago

Agreed. We could publish guidelines saying "resize to X if you plan on doing a bunch of pinning or other things that take up a lot of space", but that's not ideal. It shouldn't be nearly as bad as it was with CL since ostree dedups across deployments. My guess is ~2x the size of a single deployment should be ok.

cgwalters commented 5 years ago

(Happened to stumble across https://bugzilla.redhat.com/show_bug.cgi?id=1061478 )

cgwalters commented 5 years ago

Let's call what we have now with the FCOS preview release "phase 1". Phase 2 work: https://github.com/coreos/fedora-coreos-tracker/issues/94