Closed libre-man closed 3 months ago
Hello @libre-man, thanks for reaching out.
Sorry to hear that this particular Firecracker change is causing issues with your use-case. We explicitly tried to avoid this by posting our plans to make this change in Github discussions: https://github.com/firecracker-microvm/firecracker/discussions/4137 and a corresponding message on our public Slack workspace (unfortunately, I don't have a link to the latter, due to Slack retention time limitations).
Let me try and give you some background on why we took that decision. The data we were able to gather during the developer preview of the snapshot feature so far, pointed to the fact that Firecracker users typically couple Firecracker snapshots with the Firecracker binary that created them. In other words, users tend to re-create their snapshots every time they switch Firecracker version (both for upgrading and rolling back).
We have found this practice to be beneficial for various reasons.
Thinking of the above, it makes sense that our customers want to have mechanisms in place to be operationally ready to re-create snapshots and we found out that that's what they typically do.
At the same time, with the previous state of affairs our test matrix was growing extremely complex; we needed to test for all the combinations of supported Firecracker versions. It was becoming increasingly difficult for us to ensure this matrix was fully covered adding a lot of code complexity for testing. Moreover, all the necessary book-keeping for tracking which feature was supported on which Firecracker version for snapshotting was not straight-forward either.
I hope this clarifies the reasoning of our decision.
Now, I would like to understand your use-case better to see if/how we can help you out better. More specifically,
Hi @bchalios,
Sorry I was not monitoring the Github discussions, only the releases of Firecracker, which I guess is my own fault. We use snapshots for our platform (codegrade.com) for our automatic testing infrastructure. We let teachers set up a machine (=Firecracker VM) and we take a snapshot of that. Then when a student hands in code (and some other cases too) we resume that snapshot, upload the files the student has submitted, and run the tests configured by the teacher. We want to keep the machine running, so take a snapshot, to reduce latency for students as much as possible. Our target internally for this is under two seconds between submitting and getting the first output of the tests.
We also want all students to have the exact same environment as the other, to make sure grading is done fairly.
We store the snapshots using btrfs as diffs to a base snapshot (slightly simplified this), to make downloading as quick as possible. We furthermore cache these snapshots as btrfs subvolumes on machines that run firecracker VMs.
How often do you take new snapshots for your customers?
We never rerun setups if the teacher doesn't request this, primarily because teachers are not properly pinning dependencies so rerunning might break their test setup (e.g. they install numpy, not numpy==1
).
How often do you upgrade Firecracker versions? Is it on every Firecracker release? What is that drives your decision to upgrade the Firecracker version you are using?
We try to upgrade every half year or so, but honestly firecracker has been very stable for us, so the urgency is often low.
Do you have mechanisms in place to roll-back your snapshots and/or Firecracker version if something goes wrong?
We try them out, but no rolling back is a situation that we know is problematic but we can't easily solve yet.
When you speak about "a performant manner of upgrading snapshots", what does performant mean for you?
Anything under 500ms is fine for us, but everything eats into this 2s budget, so the less the better. If it is longer we would need to do some bulk process, but this is quite difficult in combination with btrfs diff send/recv
.
We let teachers set up a machine (=Firecracker VM) and we take a snapshot of that
Is the setting up of a machine, a manual process operated by the teachers? Or is it something that you can run as a script? If it is the latter, you could rerun the setup even if the teacher does not request the setup.
Also, what sort of setup might the users do? Would they install anything on the root filesystem?
We let teachers set up a machine (=Firecracker VM) and we take a snapshot of that
Is the setting up of a machine, a manual process operated by the teachers? Or is it something that you can run as a script? If it is the latter, you could rerun the setup even if the teacher does not request the setup.
Also, what sort of setup might the users do? Would they install anything on the root filesystem?
It is a script, but we care not sure that this script is pure, so running twice might result in different results. This is especially the case when they install packages through things like pip
. They do install stuff in the root filesystem.
We try to upgrade every half year or so, but honestly firecracker has been very stable for us, so the urgency is often low.
Right, so do your snapshots then live for more than 6 months? Would it be impossible to re-create them twice per year?
Yeah they live for longer than that, as the cadence for most teacher is giving their course once a year, updating them twice a year would mean that often when they come back for the course for next year, the setup for the course is broken.
Ok, I see. Even so, this is a bit dangerous on your side, since there will always be the possibility that we introduce a change that would break backwards snapshot compatibility (even with the old schema). It is true that with the new snapshot format these events will be more frequent and I understand that this is frustrating comparing the previous situation, but you should have a mechanism in place for these cases regardless of the frequency of snapshots breaking.
Another reason why we suggest you should have such a mechanism in place is security. We suggest to users to always keep applying security patches on systems running on top of Firecracker. Kernel patches, in particular, will require you to reboot your microVMs, which essentially means "get new snapshots".
All of the above comes from the majority of our users' experience in production. In other words, even when we did have stronger backwards compatibility guarantees, users tended to have systems in place to recreate the snapshots for all the reasons I mentioned. That is what guided us in simplifying the assumptions we make about snapshots and shaping the interface around them.
With all this in mind, it seems that we cannot currently support this use case. I don't think we have a solution that expands the backwards compatibility snapshot policy without bringing back all the (code maintenance and testing) overhead and provide false expectations/promises about the lifespan of Firecracker snapshots.
That is a shame, the security of the kernels running in the VM aren't really that important: if students can use a kernel exploit they can also just solve the assignment, and for us the backwards compatibility is simply extremely important. This basically means we are now stuck on a firecracker version that will no longer get security updates.
I hope a more set in stone backwards compatibility guarantees can be given in the future. At which point we can start thinking about migrating.
Feature Request
Since firecracker 1.7.0 snapshots taken with earlier versions cannot be loaded anymore. This causes us to be stuck on firecracker 1.6.0 as it is not possible for use to invalidate older snapshots.
Describe the desired solution
We would like to have a way to, in a performant manner, be able to upgrade firecracker snapshots to the newest version.
Describe possible alternatives
A possible alternative would be to just still support the old snapshot state files. In the linked PRs it doesn't state why this backwards incompatible change was needed.
Additional context
For us snapshots are the thing that make firecracker useful for our product, which makes the current approach of being able to break them quite a big issue. The lack of backwards compatibility difficult to deal with, as simply restarting the VM breaks assumptions are customers make about snapshots. It was previously documented that snapshots are backwards, and if at all possible, forward compatible.
Checks