RFD 154 Flexible disk space for bhyve VMs

mgerdts commented 5 years ago

For discussion of:

https://github.com/joyent/rfd/blob/master/rfd/0154/README.md

mgerdts commented 5 years ago

@max123 asked about the following scenario:

Provision with three disks. They become disk0, disk1, and disk2.
Remove disk1
Provision a new disk

Does the new disk get provisioned as disk1 or disk3?

My thoughts:

For MVP, a new disk will be created in the next available slot. That is, disk0 is occupied, so it can't go there. disk1 is not occupied, so that slot is taken.
Longer term, CreateMachineDisk may allow a slot input field that allows the disk to be placed into any unoccupied slot. Presumably this would allow a slot field in the disks list passed via CreateMachine.

The longer term plan suggest the use of slot names ("disk0", "disk1", ...) instead of ZFS guids would be a good idea.

kusor commented 5 years ago

Some comments:

Do we really need to drop image_uuid compatibility with disks? Can't we keep image_uuid as the image for the bootable disk?
I'm not sure if we can pass all these params to vmadm from CloudAPI. I'm wondering we'll need to review and possibly update VMAPI. Not sure about cn-agent tasks, though.
An update to AdminUI could also be required, at least in order to show the "missing" fields for customers VMs. More work required if we want ops to be able to create machines with disks payload from AdminUI, which I'd assume we want.

Regarding size limitations for package/image changes, I think it's a good idea to establish the maximum size a given machine, provisioned with some initial constraints, can be resized to automatically. I'd say we want to keep possible any change, but those we could call unexpected or dangerous due to fraud would require to go through approval.

On a side note, given we do all the CloudAPI constraints through plugins, we should review the way we want to apply disks related package and limits here and, why not, figure out if we want to provide a more "handy" way for ops to change those limits.

mgerdts commented 5 years ago

Do we really need to drop image_uuid compatibility with disks? Can't we keep image_uuid as the image for the bootable disk?

As a simplifying first step, that sounds fine. Let's make it so that each object in the disks array can only specify size. In the future we can extend this to provide an image, which slot a particular disk should occupy, preferred block size, etc.

mgerdts commented 5 years ago

@kusor and I chatted about

I'm not sure if we can pass all these params to vmadm from CloudAPI. I'm wondering we'll need to review and possibly update VMAPI. Not sure about cn-agent tasks, though.

He said:

We'll need to update VMAPI here to validate proper disks input https://github.com/joyent/sdc-vmapi/blob/master/lib/common/validation.js#L1136 And be careful with machine_create cn-agent task depending on what we do regarding image_uuid and disks: https://github.com/joyent/sdc-cn-agent/blob/master/lib/backends/smartos/tasks/machine_create.js#L191-L195

It's not clear that these details are worthy of mentioning in the RFD, but they do point to tasks that will need to be assigned as we break down the work for this effort.

mgerdts commented 5 years ago

RFD 34 says:

DAPI will be used to provision a new (target) instance with the same set of provisioning parameters (you can think of this as an instance reservation) on a different CN (the target CN).

Some iterations of this proposal have allowed the specification of images for data disks and have allowed removal of data disks. That would be a conflict with RFD 34. To not endanger the success of RFD 34, these more advanced cases (data disk image, device removal) need to be delayed until after RFD 34 is stable and we can figure out how to record image changes so migrations are not broken.

askfongjojo commented 5 years ago

Sorry about the late feedback. Here is some feedback which I've shared separately and some more I just came up with:

CreateMachine: I think there is a need to maintain backwards compatibility and for it to work for simple cases. When no disk is passed, maybe the system should fall back to the current behavior.
Resize usability: The proposed design requires user to understand the right sequence of atomic actions. For example, when resizing up, the user needs to first resize up the machine package, then invoke ResizeMachineDisk or CreateMachineDisk to add more disk space or disks. Likewise, when resizing down or deleting disks, user has to do the steps in reverse order, i.e. ResizeMachineDisk or DeleteMachineDisk first, then resize down the machine package. It'll be nice to make it a composite operation from a usability perspective. The drawback is obviously that if one of the atomic operations fails, it'll leave behind a partially resized machine.
Resize for the purpose of redistributing quota: Do we want to support a resize action to re-distribute disk space between two disks without changing the overall disk quota for the machine (e.g. re-allocate 50G from disk 1 to 0)? This kind of change can be very tedious to do based on the current proposal. I don't know if it's an important enough use case for us to better support it.
Flexible to non-flexible disk transition: Can user move from flexible to non-flexible disk package (though I can't think of a reason why one would want to do so)? It may be helpful to prohibit that from an operational and testing perspective (e.g. being able to assume that all instances with the non-flexible disk attribute were created prior to this feature, not having to ensure that switching back-and-forth between the two settings will work in all cases).
The table that "outlines the results when used with various images": Apart from "snowflake images" (e.g. those that have root disks resized up manually by operators), is it possible to have images that exceed 10GB in size? Currently when creating images from VM, it's only retaining data in the root disk which can't exceed the fixed quota of 10G. Also there is a typo in the table heading (right-most column should read "Flexible...".
vmapi scope: Apart from adding validations in vmapi for the disk size math, it also needs to be enhanced to support the disk operations. From the vmapi docs, it doesn't look like it currently supports modification of disks. The same operations also need to be supported by node-sdc-clients.

Finally from the verbal discussions I had with mgerdts, the feature is not intended to support arbitrary resizing of disks that are outside the confine of the package quota. IOW, when resizing a disk, user must first specify a package that has the disk quota that matches the summation of the desired disk quota. This wasn't obvious to me initially. It may be good to call that out in the introduction.

cburroughs commented 5 years ago

(minor) I think it would be cleaner for the cli if we did triton instance disk foo instead of triton instance foo-disk. This is how, for example, nics are treated.

If I have an inflexible smatos instance, does triton instance disks show me just my single old-school "disk", an error, or something else?

mgerdts commented 5 years ago

@askfongjojo said:

CreateMachine: I think there is a need to maintain backwards compatibility and for it to work for simple cases. When no disk is passed, maybe the system should fall back to the current behavior.

That is the intent of what I wrote, but will make it explicit.

That being said, I think we probably want to allow packages to specify their own defaults. This is the intent of the Default disks in package section. This will allow us to create packages that default to leaving space for snapshots.

Resize usability: The proposed design requires user to understand the right sequence of atomic actions. For example, when resizing up, the user needs to first resize up the machine package, then invoke ResizeMachineDisk or CreateMachineDisk to add more disk space or disks. Likewise, when resizing down or deleting disks, user has to do the steps in reverse order, i.e. ResizeMachineDisk or DeleteMachineDisk first, then resize down the machine package. It'll be nice to make it a composite operation from a usability perspective. The drawback is obviously that if one of the atomic operations fails, it'll leave behind a partially resized machine.

Surely we can have some refinement to make this easier, but we also need to have these basic operations. That is, a customer may be wishing to move to a larger package to accommodate a series of snapshots they intend to create over the next few hours. We must not make it so that moving to a larger package unconditionally uses that space (e.g. by growing the data disk).

I propose that we start with these basic operations and get customer feedback on which composite operations are important.

Resize for the purpose of redistributing quota: Do we want to support a resize action to re-distribute disk space between two disks without changing the overall disk quota for the machine (e.g. re-allocate 50G from disk 1 to 0)? This kind of change can be very tedious to do based on the current proposal. I don't know if it's an important enough use case for us to better support it.

Redistributing disk space implies shrinking a disk. Shrinking a disk is only safe if the disk has not had a partition table or file system make use of the space that is being taken away. That is, I expect that shrinking a disk will almost always lead to data loss in the guest. We should not make that attractive or easy to do.

I'm really aiming for shrinking a disk to be only as a recovery from "oops, I grew the wrong disk." I've been half expecting someone to talk me out of even offering support for shrinking disks. Likewise, I'm on the fence about removing disks.

The time when these features seem likely to be used the most is in testing.

Flexible to non-flexible disk transition: Can user move from flexible to non-flexible disk package (though I can't think of a reason why one would want to do so)? It may be helpful to prohibit that from an operational and testing perspective (e.g. being able to assume that all instances with the non-flexible disk attribute were created prior to this feature, not having to ensure that switching back-and-forth between the two settings will work in all cases).

I had thought about adding flexible to non-flexible but couldn't come up with a good reason to have it. "To support testing" seems like a good reason. Regardless, it would be more work to breakvmadm update $uuid flexible_disk_size= (flexible to non-flexible instance conversion, ignoring package) than it is to support it.

The table that "outlines the results when used with various images": Apart from "snowflake images" (e.g. those that have root disks resized up manually by operators), is it possible to have images that exceed 10GB in size? Currently when creating images from VM, it's only retaining data in the root disk which can't exceed the fixed quota of 10G.

An image can be larger than 10 GiB with:

Deploy an instance using a 10 GiB image, setting the boot disk size to something larger.
Create an image from that instance

Also there is a typo in the table heading (right-most column should read "Flexible...".

ok, will fix

vmapi scope: Apart from adding validations in vmapi for the disk size math, it also needs to be enhanced to support the disk operations. From the vmapi docs, it doesn't look like it currently supports modification of disks.

ok, will fix.

The same operations also need to be supported by node-sdc-clients.

ok, will fix.

Finally from the verbal discussions I had with mgerdts, the feature is not intended to support arbitrary resizing of disks that are outside the confine of the package quota. IOW, when resizing a disk, user must first specify a package that has the disk quota that matches the summation of the desired disk quota. This wasn't obvious to me initially. It may be good to call that out in the introduction.

That's not quite right - particularly the "user must first..." part.

Resizing of the disks outside of the package quota is not supported. However, that doesn't mean that resizing a disk necessarily means moving to a new package. It could be that an instance was created in a way that not all of the quota (package.disk) is allocated to disks. That space may have been intentionally left idle (for snapshots, future disk creation, or future disk growth) or may be recently idle (due to snapshot or disk deletion).

I'm not sure if "disk quota that matches the summation" in the statement above is intended to be a strict match or if it was just a loose way of saying "the sum of disk space needs to be less than or equal to the package's quota." If "matches" was intended to mean an exact match, it should be understood that this meaning is not consistent with essential aspects of this proposal.

mgerdts commented 5 years ago

(minor) I think it would be cleaner for the cli if we did triton instance disk foo instead of triton instance foo-disk. This is how, for example, nics are treated.

ok, will fix.

If I have an inflexible smatos instance, does triton instance disks show me just my single old-school "disk", an error, or something else?

If it is a bhyve instance, it will show you all of the disks. More inflexible instances, that will typically mean that it will show you two disks: the boot disk and a data disk.

joshwilsdon commented 5 years ago

I haven't read everything carefully so I might be duplicating comments made elsewhere, but a few notes...

https://jira.joyent.us/browse/OS-7004 should probably be fixed with this too
DAPI will probably need to be updated. https://github.com/joyent/sdc-designation/blob/master/lib/algorithms/calculate-server-unreserved.js#L132-L230 calls disk usage calculation "a bit involved". I think that's understating the situation.
- Related to ^^ we should probably do an overhaul on what parameters we're passing from cn-agent to CNAPI wrt usage on a server. That happens here: https://github.com/joyent/sdc-cn-agent/blob/3ce58a35d8a2603ea11065d7367279299a8d09eb/lib/backends/smartos/index.js#L168-L405
- Not public (sorry public readers) but https://mo.joyent.com/docs/engdoc/master/resize/index.html is related reading here.

Also a more general question... The problem statement says the motivation here was that customers want a single disk rather than having a separate os and data disk, why does this proposal include support for anything other than that? Why not just make it so bhyve VMs always have 1 disk that's always created from the image and then resized to the quota (minus whatever space they want to reserve for snapshots) and simplify things by not supporting multiple disks at all? That seems like it would make the system simpler rather than more complex while still giving customers what they want. It also seems like it would eliminate the problems discussed wrt "gratis space" in that two VMs would get the same total space regardless of image.

jclulow commented 5 years ago

Why not just make it so bhyve VMs always have 1 disk that's always created from the image and then resized to the quota (minus whatever space they want to reserve for snapshots) and simplify things by not supporting multiple disks at all?

Speaking for myself, I'd like to be able to take the quota for a package and slice it up into multiple disks. We'd like to be able to run the ZFS test suite in a VM provisioned via CloudAPI, and it requires something like five disks other than the OS root disk to work correctly.

joshwilsdon commented 5 years ago

We'd like to be able to run the ZFS test suite in a VM provisioned via CloudAPI, and it requires something like five disks other than the OS root disk to work correctly.

Do we think it's going to be common for customers to want to run the ZFS test suite? Maybe we should add that as one of the requirements to the Problem Statement section if that's one of the things customers have been asking for, and that's one of the use-cases we're trying to optimize for here.

jclulow commented 5 years ago

Do we think it's going to be common for customers to want to run the ZFS test suite?

I'm not sure, but I do know this: If we can't do it with Triton, I'll have to find another cloud for this workload -- obviously that'll be pretty disappointing when we could be eating our own dog food.

joshwilsdon commented 5 years ago

@jclulow does the ZFS test suite not support disk partitions? Or loopback devices?

jclulow commented 5 years ago

I believe some of the tests involve things that depend on speaking to what appears to the operating system to be an actual disk.

joshwilsdon commented 5 years ago

Do you have a link to the documentation or the code for the specific ZFS test suite you're talking about here? The only documentation I was able to find for "ZFS test suite" was https://github.com/zfsonlinux/zfs/tree/master/tests which uses loopback devices by default.

timfoster commented 5 years ago

https://github.com/openzfs/openzfs/blob/master/usr/src/test/zfs-tests/doc/README talks about the setup and configuration, requiring a 'DISKS' variable. Throughout the test suite, there's expectations that we're operating on real disks, e.g. https://github.com/openzfs/openzfs/blob/master/usr/src/test/zfs-tests/include/libtest.shlib#L718

mgerdts commented 5 years ago

I haven't read everything carefully so I might be duplicating comments made elsewhere, but a few notes...

https://jira.joyent.us/browse/OS-7004 should probably be fixed with this too

I've updated labels and dependencies so that it doesn't get lost. In any case, that would have come up in tests that get written for this.

DAPI will probably need to be updated. https://github.com/joyent/sdc-designation/blob/master/lib/algorithms/calculate-server-unreserved.js#L132-L230 calls disk usage calculation "a bit involved". I think that's understating the situation.

Related to ^^ we should probably do an overhaul on what parameters we're passing from cn-agent to CNAPI wrt usage on a server. That happens here: https://github.com/joyent/sdc-cn-agent/blob/3ce58a35d8a2603ea11065d7367279299a8d09eb/lib/backends/smartos/index.js#L168-L405

Thanks for the pointers. We will add detail in these areas.

Not public (sorry public readers) but https://mo.joyent.com/docs/engdoc/master/resize/index.html is related reading here.

Thanks! I have given it a quick read (and it deserves another more careful read). If we go with the assumption that resize is an important feature that we wish to support, we need to be aware that DAPI's packing algorithm may be working against this feature. That may speak to a need for future collaboration between the resize feature and the migrate feature (RFD 34) to ensure the migration target is selected with the new size in mind.

Also a more general question... The problem statement says the motivation here was that customers want a single disk rather than having a separate os and data disk, why does this proposal include support for anything other than that? Why not just make it so bhyve VMs always have 1 disk that's always created from the image and then resized to the quota (minus whatever space they want to reserve for snapshots) and simplify things by not supporting multiple disks at all? That seems like it would make the system simpler rather than more complex while still giving customers what they want. It also seems like it would eliminate the problems discussed wrt "gratis space" in that two VMs would get the same total space regardless of image.

At no point have I heard that users/customers want to be restricted to using only one disk. To the contrary, I did hear that the second drive is still needed. Another community member would like to be able to create boot images on data disks ala Amazon EBS Surrogate. The full solution for that would require IMGAPI changes as well.

KVM resize seems to be a rather popular topic (SUP-642, SWSUP-1172, TRITON-350, SOLENG-350, and others) and I expect that trend to continue as customers transition from LX (and KVM) to bhyve. I had a phone conversation with a customer last week that indicated the same. I'll update the problem description.

joshwilsdon commented 5 years ago

At no point have I heard that users/customers want to be restricted to using only one disk.

Ok. The reason I thought this, is that the only things in the Problem Statement talking about the number of disks were:

Some customers see no value in splitting space between the root disk and the data disk. Rather, this causes extra work because it forces application configurations to be customized to use non-standard paths and/or causes confusion for users. These customers tend to request that all space is allocated to the boot disk.

and:

Additionally, customers that may be transitioning from LX to bhyve may have an easier transition without having space fragmented across the root and data file systems.

Which both talk only about reducing the number of disks to one disk. There was nothing in the Problem Statement about using more than one disk and yet the entire solution seems to be focussed on this. If the problem statement is updated to say that some customers also want a dynamic number of disks and the ability to change the number and configuration of their disks after provisioning, then that would bring the problem and solution into alignment.

Throughout the test suite, there's expectations that we're operating on real disks [...]

That does not prevent the use of loopback devices though, right?

In any case it sounds like there are other use-cases that were missed from the original RFD that require multiple disks. So the ZFS case is just one among those. If the ZFS test suite was the only use-case I'd have suggested that it would be better use of time to fix that to be able to use loopback devices (assuming we don't have that support, since zfsonlinux seems to) rather than changing the cloud stack to support that one limited use-case. But it sounds like there are other use-cases that were not mentioned so the point is moot.

mgerdts commented 4 years ago

RFD has been published.

TritonDataCenter / rfd

RFD 154 Flexible disk space for bhyve VMs #115