bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.52k stars 501 forks source link

LVM PV to fill the primary disk #3152

Open tommie opened 1 year ago

tommie commented 1 year ago

Caveat I'm new to OpenEBS, and I apologize if anything here is misleadingly wrong.

What I'd like

BROS currently fills the primary with the data partition, which is an ext4. This is useful for container images, logs and similar ephemeral data.

In my attempts to build a production-grade Kubernetes cluster that can scale from 1 to many nodes, I'm looking for a way to run OpenEBS in a way that can use the spare capacity for replicated block storage.

E.g. Hetzner's cheap servers (Hetzner Robot, not Hetzner Cloud) can have TBs of storage, sometimes with HW RAID. They often have two disks of the same size. Either disk configuration (left separate, mirroring or striping) will make one device wastefully huge for the BROS data partition. It would be very useful to install BROS, tag the node with "is block storage" and have the unused quantity show up as available OpenEBS block devices. (Any non-primary disk would be no problem, since BROS doesn't care about them.)

Any alternatives you've considered

OpenEBS supports two modes: local and replicated, each with multiple drivers available.

Local PVs

For local PVs, useful only for StatefulSets, OpenEBS can use

The local directory is the only currently viable option, pointing it to the data file system. A local block device is not viable right now, since the disk is filled with the data partition. These drivers provide no OpenEBS benefits in terms of snapshotting, or backups. Using local directory would directly interfere with quota isolation of the Kubelet vs OpenEBS. Using local block device is simplest to implement, but requires statically deciding the size of the data partition, which may not scale well. (Note that we're talking about stateful servers, so "just tearing it down and make a new" might be more painful/costly than the normal BROS usecase.)

The ZFS and LVM drivers are the most versatile, in that OpenEBS can orchestrate snapshotting and backups. (And of the two, ZFS is the most capable, see links above.) They require appropriate kernel support, and user-space utilities. Supporting this would require

Conclusion Combining versatility and simplicity, I propose supporting the LVM driver for ReadWriteOnce volumes, used by StatefulSets. This allows easily adjusting the data file system after the fact, and still make use of advanced OpenEBS's snapshotting capabilities. Using ZFS would be a larger change to BROS.

Replicated PVs

For replicated PVs, the cStor driver is essentially a forked ZFS, but is managed outside the kernel. In this case, one gives OpenEBS unused block devices, and it will create zpools and sprinkle stardust. (I don't believe ZFS kernel modules are required for this.)

Using this requires having open-iscsi installed on nodes, and of course leaving an unused block device. There is no specific benefit to using LVM here, since snapshotting and backup functionality is implemented in the cStor layer. However, using LVM would allow the same benefits as for the local case: ability to easily adjust the data file system size while remaining online.

Conclusion Making the data partition have a max size, and adding a dedicated partition to fill the disk would work, but using LVM will make it far easier to adjust the relationship between data file system and cStor device size at installation time.

Together

For both simple local cases, and replicated volumes, it seems that using LVM for the data partition has benefits. The default could still be that the data file system spans the entire LVM PV, and a user-data option would give it a max-size. Any remaining LVM space would be given to an unused PV.

tommie commented 1 year ago

I just had a look at what Rook/Ceph requires. It can use raw partitions, but for encryption-at-rest, it looks like it uses LVM.

AFAICT, Rook doesn't require open-iscsi, which is nice from a BROS perspective.

So for partition encryption, and resizing, LVM would be useful in the Ceph case as well.

yeazelm commented 1 year ago

Thanks @tommie for the detailed issue! There is a lot to unpack from all the info you provided. We would need to take a look at what this might entail to get LVM working for the data partition. We have device-mapper configured, but as their docs call out, OpenEBS assumes quite a bit of things are installed on the host OS depending on your plans:

* Based on the selected data engine, the nodes should be prepared with additional packages like:
  * Installing the ext4, xfs, nfs, lvm, zfs or iscsi, nvme packages.
  * Prepare the devices for use by data engines like - making sure there are no the filesystem installed or by creating an LVM volume group or ZFS Pool or partition the drives if required.

In this case, we probably would need to have the lvm tools in the image plus the data partition formatted with an LVM PV like you originally called out. There is likely a bit more work than just adding the tools since changing that data partition comes with its own set of concerns to solve. It is an interesting use case, thanks for reporting it!

tommie commented 1 year ago

Thanks for the response.

I don't think OpenEBS is my main issue right now, because I could as well go with Ceph. (Still exploring options.) But I think the packages you are listing is an exhaustive list if you want to do everything. I'm guessing that most OpenEBS users are content with running ext4 on iscsi, with lvm on the server.

If there's a story on how to get hold of an unformatted partition to be used for some block storage solution, that would be a great start. I.e. thinking about the size trade-off between DATA and this "VOLUME" partition, and how to configure it.

yeazelm commented 1 year ago

If there's a story on how to get hold of an unformatted partition to be used for some block storage solution, that would be a great start. I.e. thinking about the size trade-off between DATA and this "VOLUME" partition, and how to configure it.

This is an interesting question and has a lot of answers that involve "it depends" so I'll try to answer as specifically as I can. Bottlerocket specifically treats the "data" partition as something to be found and expanded on boot: repart-local.service. So that is intended to be expanded for the rest of the disk. The primary thing to think about this is that Bottlerocket itself will try to adjust the size of the partition for the disk where BOTTLEROCKET-DATA lives, otherwise, the OS doesn't really do any auto-formatting for other disks. So if you attach another disk, it will just be left raw like you called out. This is where CSI drivers take over and can manage additional disks (assuming they don't need anything in the host OS Bottlerocket doesn't provide). I say this all to hopefully answer your question around our current mental model for this DATA partition. The BOTTLEROCKET-DATA partition really exists to solve for "what do you do when you have a read only root filesystem" and not necessarily about large, general purpose storage. We do store container images there, along with logs and configuration, but we haven't really considered this space as a place to share for cluster storage directly via multiple different partitions.

tommie commented 1 year ago

I have had some time to play with Rook Ceph, and it's just too resource hungry to run on a single small node (in all of ephemeral disk, RAM, CPU.) It looks like I'll need to build something that can start small and migrate to something bigger.

Re. size trade-off: So far, I've been emulating actual partitioning using fallocate to create a file on DATA and use a loop device. I have contemplated some models:

  1. Use a minimum-data-size and a minimum-volume-size, so that if the partition is smaller than the sum, then no volume partition is created. It maximizes the size allocated to PVs. The key assumption here is that DATA doesn't need to grow, because it is created in proportion to other hardware resources, like CPU and RAM. This is what I'm using for now, playing around.
  2. Use a half-half approach and a minimum-useful-size constant. Split the disk in two halves. As above, below some minimum size, they become unusable. Since we're concerned with disks on the order of 1 TiB, and a DATA partition can be useful even at 1% of that, the question is just about avoiding building a DATA or PV partition that is unusably small. So if the machine we're running on only has a 10 GiB disk, then don't create a PV partition, but if it has 1 TiB, it probably doesn't matter if we split the disk 10 + 990 GiB or at 500 + 500 GiB.
  3. Use a maximum-data-size. On a single host, having a 500 GiB DATA is probably wasteful. If the disk is larger than 2 * maximum-data-size, then create a PV partition using anything above maximum-data-size. The drawback is that I think it's easier to identify a minimum useable DATA size than a maximum. However, this could make it easy to differentiate small/medium cloud instances, compared to dedicated servers that focus on disk. The maximum could be 512 GiB or so, as per discussion under (2).

Of course, we could make it proportional, or based on CPU and RAM. And complicated. But for provisioning, I think what matters is (1) to avoid filling the disk with DATA, and (2) avoiding unusably small partitions. The rest could be done with a bootstrap/host container. Another "issue" here is that bootstrap containers run after repart-local, meaning we can't customize the repart configuration without changes to Bottlerocket, and once the bootstrap container is running, the file system is already in use.


Next in line to test is https://github.com/openebs/lvm-localpv, which should be the minimum required for getting dynamic provisioning on a one-host system. Needless to say, this will definitely require LVM. Just adding the kernel modules would be enough to test with files and loop devices, but I still think it would be appropriate if DATA was sitting on top of LVM.

What are the chances of adding LVM kernel modules in Bottlerocket Metal? NVM, this already works. (So LVM isn't a separate thing anymore, but part of DM, it seems.)

tommie commented 1 year ago

We have device-mapper configured,

Confirmed using LVM2 tools works in a bootstrap container. I can set up a loop device, use pvcreate and vgcreate, and the openebs/lvm-localpv provisioner picks it up.

As for moving DATA on top of it... There are broadly three changes that would be needed:

  1. The repart-local first needs to create/enlarge the PV partition to fill the disk. I.e. the same logic as now, except it needs to use LVM tools (pvcreate/pvresize.) I don't think the VG needs any special attention.
  2. There must be an LV created for the DATA. It needs to grow up to some maximum.
  3. The boot code needs to know the difference between DATA-PV and DATA-LV block devices, regarding what is being resized and where the file system is created.

systemd-repart doesn't seem to support resizing LVM things, so we would have to call the PV and LV commands directly. If the image doesn't contain a stub PV partition, this requires the pvcreate, vgcreate and lvcreate commands. If there is a stub LV partition, pvresize and lvresize are needed instead.

tommie commented 1 year ago

I'm not sure pvscan --cache -aay is required on boot, but it probably should be run: https://gitlab.com/lvmteam/lvm2/-/blob/main/scripts/lvm2-pvscan.service.in