RFD 148 Snapper: VM snapshots

mgerdts commented 6 years ago

This is for discussion of

RFD 148 VM Snapshots

https://github.com/joyent/rfd/blob/master/rfd/0148/README.md

mgerdts commented 6 years ago

@twhiteman asked via chat and I answered

What happens when Manta is not available (e.g. COAL, some Triton customers), is there some fallback for that case?

TBD.

Manta snapshot storage dir - could that be placed in the Manta account of the customer, e.g. ~/toddw/.snapshots/vms/$UUID/ ? That way they would be charged for the storage of their snapshots without additional billing changes, and they could also manage their snapshots through manta operations.

There is a concern about receiving snapshots from untrusted sources. Unless we have RFD 14 implemented, we can't put the image somewhere that the customer could tamper with it.

After further reading, I'm wondering whether it makes sense for snapshots to become IMGAPI images. What differentiates this from the existing CreateImageFromMachine API

for 3. c.f. https://apidocs.joyent.com/cloudapi/#CreateImageFromMachine

IMGAPI images (I think) only cover the boot disk and require that the guest be rebooted to run a prepare-image script. That being said, it may make a lot of sense to extend IMGAPI to cover this use case.

jussisallinen commented 6 years ago

@mgerdts The subject has wrong RFD number, Snapper: VM snapshots is RFD 148.

papertigers commented 6 years ago

What will we do about snapshots that are extremely large? Say a customer has a 1TB or higher instance. Will we be able to reliably send such large snapshots to manta? Will the customer also be charged for data usage in manta?

mgerdts commented 6 years ago

On Wed, Jul 18, 2018 at 12:50 PM, Michael Zeller notifications@github.com wrote:

What will we do about snapshots that are extremely large? Say a customer has a 1TB or higher instance. Will we be able to reliably send such large snapshots to manta?

We will have to experiment with how reliable that is. The stories I've heard in the past of problems with replication failures of large ZFS streams have generally been around sending data over long distances with questionable networks in the middle. When transmissions involve terabytes of data, with significant bit flips in the network, the errors undetectable by TCP's checksum became a common-enough-to-worry issue.

Customers that want quick snapshots (plus upload) of such large streams should probably have a practice of frequently sending snapshots, so that the delta is never that high. We may even want to (in a future project) offer a replication strategy that makes it possible to set such a schedule automatically. This scheme would favor replication to a live pool rather than to manta.

Receiving into a live pool also has an advantage for the "failure with huge streams" problem. From zfs(1M):

       -s  If the receive is interrupted, save the partially received state,
           rather than deleting it.  Interruption may be due to premature
           termination of the stream (e.g. due to network failure or failure
           of the remote system if the stream is being read over a network
           connection), a checksum error in the stream, termination of the zfs
           receive process, or unclean shutdown of the system.

           The receive can be resumed with a stream generated by zfs send -t
           token, where the token is the value of the receive_resume_token
           property of the filesystem or volume which is received into.

           To use this flag, the storage pool must have the extensible_dataset
           feature enabled.  See zpool-features(5) for details on ZFS feature
           flags.

Will the customer also be charged for data usage in manta?

I would expect that any "per GiB per month" for snapshots would essentially be there to cover the costs of snapper's manta. Since this would likely not be in a customer's account, normal manta billing would not do the trick.

marsell commented 6 years ago

One minor concern I have is regarding the Manta paths for snapshots. Overall I like the scheme, but there's one (potentially) common use-case it would flake out on: regular database snapshots.

Regularly snapshotting a database is a good idea, and since those snapshots will hopefully never be used, there's a monetary incentive to stick to incremental. This will result in a very deep directory structure.

I don't know what Manta's directory path limit in chars is, but in practice HTTP headers over 8K are asking for trouble.

mgerdts commented 6 years ago

@marsell said:

Regularly snapshotting a database is a good idea, and since those snapshots will hopefully never be used, there's a monetary incentive to stick to incremental. This will result in a very deep directory structure.

I'm not so sure the monetary incentive is to always use incremental from the latest, as that means you can never remove any snapshot except the latest. If the source of an incremental is able to be chosen, it would allow for a scheme like a monthly full, daily incremental from the monthly full, hourly incremental from the previous daily or hourly.

I don't know what Manta's directory path limit in chars is, but in practice HTTP headers over 8K are asking for trouble.

Quite a valid point here. Presuming we use IMGAPI, we can probably leverage whatever support it already has for not deleting images that have children. The proposed hierarchy is clearly not the only way to accomplish this.

rmustacc commented 6 years ago

Thanks for starting this. I realize that this is an early draft; however, I think it would be really helpful to get the next round of details fleshed out here as there are a lot of open problems which will change things dramatically when we have a better understanding.

I think there are a couple of different classes of issues that are worth discussing:

User Visible Aspects

First, while I understand the differentiation and practicality of a full versus incremental versus differential, it's not clear how we're going to clearly articulate this to a customer. It would really help to get better sense of the UI and the actual API endpoints that are going to be visible. It's not clear if I can take snapshots of individuals disks, datasets, everything or nothing. Or how users will really get a sense of the differences between

Next, I have a bunch of questions about when can snapshots be taken. Does the instance have to be powered on or powered off? If it's powered on, how do we make sure that the guests have properly quiesced their disk state such that it makes sense to take a snapshot?

One of the main points of the introduction is that this is supposed to take a snapshot of the VM's metadata. How does that work? What metadata are we taking a snapshot of or not? If we're rolling back and instance then are we also creating and destroying datasets on the host? What about things like NICs, CNS names, and other context? The on-disk state probably only makes sense in the context of everything else. For example, servers will have configuration related to the network configuration on disk to drive services. If I roll back to an older image, what if we no longer have that IP address available or not? All in all, I think this really deserves a lot more thought in the RFD.

Storage of Snapshots

In most cases writing to Manta will be done over the WAN. I think the RFD is currently way to optimistic about performance over the WAN for long, extended transfers. While MPU may help us with this, if we're realistically talking about 1-2 TB transfers, that's going to take a long time to actually transit the WAN, even if we can say get 100 Mbit.

Conversely, the local storage discussion isn't as straightforward. A simple web server as discussed is probably not going to cut it (what happens when you run out of space) and you're going to have a pretty quick feature creep where on-prem will ask about things like NFS, CIFS, etc.

It's not clear in the section that's talking about ZFS reservations as to how long those refereservations will exist for the snapshot? What happens if everyone wants to take that snapshot at the same time on a CN that doesn't have a lot of available capacity? Does something fail? If so, for whom? How does that come back and impact provisioning and DAPI? Will this allocated temporary space be made clear to DAPI?

Intersection with Image Creation

Folks are also going to probably want to say something like can I take this snapshot and turn it into a new instance somehow. It might be worth addressing that to some extent or making clear that it's mostly going to be punted on.

mgerdts commented 6 years ago

First, while I understand the differentiation and practicality of a full versus incremental versus differential, it's not clear how we're going to clearly articulate this to a customer. It would really help to get better sense of the UI and the actual API endpoints that are going to be visible.

I'll add specifics as to how I think https://apidocs.joyent.com/cloudapi/#CreateMachineSnapshot and related calls will be used.

It's not clear if I can take snapshots of individuals disks, datasets, everything or nothing. Or how users will really get a sense of the differences between

I think the introduction and Anatomy of a VM snapshot made that clear. In particular:

This is currently scoped for bhyve only, thus we do not have the complication of the placement of kvm's layout or the possibility that there are delegated datasets outside of the zonepath dataset.
"... it contains all of the information stored in the zone's dataset (zones/<uuid>) and its descendants".

There are limitations. In particular, the following are not part of the snapshot.

Core files
Configuration that is not persisted in <zonepath>/config.

These limitations match the limitations of snapshots currently supported with triton instance snapshot. Being able to snapshot all configuration items and revert back to them will likely have a lot of overlap with RFD 126. That is, we will need a PI-independent representation of the entire config.

Until such a time as we are able to roll back all configuration, do we need to block configuration changes while snapshots exist?

Next, I have a bunch of questions about when can snapshots be taken. Does the instance have to be powered on or powered off? If it's powered on, how do we make sure that the guests have properly quiesced their disk state such that it makes sense to take a snapshot?

It is a crash-consistent image. Use snapshots if and only if your file system and consumers of raw disk can withstand an unexpected power outage.

One of the main points of the introduction is that this is supposed to take a snapshot of the VM's metadata. How does that work? What metadata are we taking a snapshot of or not? If we're rolling back and instance then are we also creating and destroying datasets on the host? What about things like NICs, CNS names, and other context? The on-disk state probably only makes sense in the context of everything else. For example, servers will have configuration related to the network configuration on disk to drive services. If I roll back to an older image, what if we no longer have that IP address available or not? All in all, I think this really deserves a lot more thought in the RFD.

I think I already covered this above. These snapshots will have many of the same issues as the snapshots that we already support.

Storage of Snapshots In most cases writing to Manta will be done over the WAN. I think the RFD is currently way to optimistic about performance over the WAN for long, extended transfers. While MPU may help us with this, if we're realistically talking about 1-2 TB transfers, that's going to take a long time to actually transit the WAN, even if we can say get 100 Mbit.

Conversely, the local storage discussion isn't as straightforward. A simple web server as discussed is probably not going to cut it (what happens when you run out of space)

I had initially proposed having some infrastructure zones with delegated datasets (snapper zones). There would be a set (minimum two, more over time) per data center. We would leverage the migration code (RFD 34) to send the VM's dataset hierarchy to the the delegated datasets of two snapper zones. The stream would be received into the snapper zone's delegate dataset.

@twhiteman suggested that things would be much simpler if we relied on Manta to handle replication and maintenance of redundancy in the face of failures. Further discussion led to the idea that storage in Manta may lead to a lot of overlap with IMGAPI. That would contribute nicely to another customer request - the ability to deploy clones from snapshots.

	Manta	Snapper
Size limit of one VM's snapshots	Manta's limit	One snapper's delegated dataset size
Snapshot store needs more space	See Manta docs	Resize snappers or add new snappers and rebalance
Rebalance	built in	Exercise for the developer
Snapshot host recovery	automatic	Exercise for the developer
Avoid WAN limits	deploy manta in each datacenter (hard)	Deploy more snapper zones in DC (easy)
Maintain redundancy	built in	Exercise for the developer
Remove intermediate snapshots	not possible	trivial to support
Recover directly to any snapshot without extra data transfer	not possible	trivial to support
Recover from interrupted transfer	not possible	possible
Development effort	Minimal	Significant

If we had some form of elastic storage, Snapper becomes much more practical because the per-snapper limitations become much more flexible and resilience can be delegated to the elastic storage. Elastic storage is not this project.

We need clarity on the requirements to know which path we should be pursuing.

and you're going to have a pretty quick feature creep where on-prem will ask about things like NFS, CIFS, etc.

If storing to a file not in manta, the expectation is that the customer's NFS server could be mounted on each CN. In no way is this project about providing NFS, SMB, etc. If using a CN's NFS client is for some reason problematic, then we may be at a point of requiring temporary space at least as large as the largest VM and the ability to use scp or similar to copy it off host.

It's not clear in the section that's talking about ZFS reservations as to how long those refereservations will exist for the snapshot? What happens if everyone wants to take that snapshot at the same time on a CN that doesn't have a lot of available capacity? Does something fail? If so, for whom? How does that come back and impact provisioning and DAPI? Will this allocated temporary space be made clear to DAPI?

Will clarify

Intersection with Image Creation Folks are also going to probably want to say something like can I take this snapshot and turn it into a new instance somehow. It might be worth addressing that to some extent or making clear that it's mostly going to be punted on.

Will clarify

ghost commented 6 years ago

I haven't had a chance to fully read and understand the RFD and all the discussion, as it is quite large and complex. But I'm a massive fan of KISS, and as an end user of Triton and SmartOS in production on our cloud, the core MVP functionality we're after is simply:

Easy

Take snapshots of KVM guests (currently missing - "snapshots are not supported for VMs of brand "kvm")
Rollback to snapshots

This is basic functionality that is missing which we already have on our non-Triton based SmartOS cloud and works fine - AFAIK it's trivial to implement. Having this would be exceptionally helpful!

We don't use delegated datasets, but if we did I imagine adding a "recursive" option for SmartOS zones would be handy which includes the delegated datasets, same for rollbacks. Otherwise, it just does the zone root.

Medium

Download a snapshot (so end users can pull their own backups)
Restore a VM from a snapshot (so end users can restore their own backups). Could be implemented as creating a new VM, passing an image flag. Since CloudAPI is https I'm guessing the image could be passed as an argument to the triton cli tooling and via the API. Might have scaling considerations (multiple CloudAPI instances for scaling).

Again the above seems fairly straight forward and plugs a big gap in functionality easily.

Hard

Store snapshots
Create VMs from stored snapshots

It would be nice if Manta wasn't needed, as we don't have an intention of spinning up a Manta instance (and AFAIK Joyent doesn't currently have Manta in eu-ams-1 and support told me there were no plans to in the near term).

KISS principle to me suggests that for a non-manta installation, images are pushed for storage on headnode via a similar mechanism to whatever imgadm uses.

Hope the above is helpful.

TritonDataCenter / rfd