Closed zenmonkeykstop closed 5 months ago
The following flow chart describes the current SecureDrop Workstation preflight updater behavior:
Updates are performed sequentially.
The Qubes 4.2 GUI updater provides important new facilities that, in principle, give us the option to migrate to the OS-level updater:
It also allows for concurrency and should generally be significantly faster than the old update method.
The following flow chart describes what a "minimum viable product" migration to the Qubes 4.2 updater could look like. The details are subject to various team-level decisions, however.
(Much credit to @eaon for workshopping this with me.)
There are a few important decision-points that should influence any "MVP" migration to the 4.2 updater. Here are some we identified:
Do we want to continue to enforce updates on login? This is an early design decision of the SecureDrop Workstation. If we want to instead rely more on the OS itself to nudge the user about updates, we may want to revisit this. The MVP sketch above assumes that we do not change this behavior for now.
Do we want to continue to force an OS reboot on all dom0
updates, or do we want to rely on Qubes OS itself to enforce reboots when required? The MVP sketch above assumes that we do not change this behavior for now.
Do we want to preserve the separate sdw-notify
script that pops up regularly to remind the user to complete updates if they have not done so? Do we want to advocate for update reminders as a Qubes OS level (upstream) feature to eventually replace it? The MVP sketch above does not cover the behavior of sdw-notify
.
If updates fail, do we want to continue to prevent launching the SecureDrop Client in all error states? The MVP sketch above does not go into detail on this; we could be a bit more selective on which failures we consider fatal.
Do we want to preserve the current mechanics for running dom0
Salt states or sdw-admin —apply
? The MVP sketch above assumes that we do not, and that we instead use this opportunity to move towards a more well-defined mechanism for running clearly scoped migrations selectively.
Do we want to invoke the updater GUI, or wrap the qubes-vm-update
tool? The MVP sketch above assumes that we invoke the GUI, so that we don't have to effectively re-implement parts of it ourselves.
@eaon, please chime in if I'm missing stuff, if I misunderstood anything from our conversation, or otherwise. :) Beyond that, I think we're nearing the end of our current cycle commitment as far as the updater is concerned, and can move most of the in-depth discussion of these questions into the planning phase for the next cycle.
I think you covered everything we talked about, thanks a lot for summarising it in a concise manner!
The only thing I'd want to expand on at this stage (to give others the chance to think about it):
If updates fail, do we want to continue to prevent launching the SecureDrop Client in all error states? The MVP sketch above does not go into detail on this; we could be a bit more selective on which failures we consider fatal.
I think given how SDW works, I would continue to treat update failures as fatal. We can't know if the updates are security updates or not, and I'd err on the side of caution. However: there is a potential to allow the client to be started with limited functionality: offline-only, showing messages as normal but only file export is enabled. (We could go even more granular: offline-only only being applied if the whonix update failed)
To elaborate: opening or printing files (same thing) sent by sources when we know there may be a security update waiting for us is probably a bad idea, even if the chances that the file someone is working with that session has that exact exploit are low. Simultaneously, leaving our users with no access to their data is less than ideal UX. How much effort it would be to implement this in the client I don't know, but if it's straight-forward, I'd love if we considered it.
Back to the updater though: my understanding of Qubes Updater is that far more often than not we will know that there's updates, or know that there were none at the time of the most recent staleness check (which may be a subtle detail we want to discuss too), rather than learning about them every day when we enforce an update attempt. So with sourcing out our "check for updates and install them" behaviour to Qubes Updater, we will have a dramatically lower failure rate as we just don't need to run a whole "check for updates and install them" command as often, since it will happen automagically in the background.
@eaon, I largely agree with you in https://github.com/freedomofpress/securedrop-workstation/issues/899#issuecomment-1587555406 and in turn with the MVP positions @eloquence suggests in https://github.com/freedomofpress/securedrop-workstation/issues/899#issuecomment-1583560608.
If updates fail, do we want to continue to prevent launching the SecureDrop Client in all error states? The MVP sketch above does not go into detail on this; we could be a bit more selective on which failures we consider fatal.
I think given how SDW works, I would continue to treat update failures as fatal. We can't know if the updates are security updates or not, and I'd err on the side of caution.
I agree with this too.
However: there is a potential to allow the client to be started with limited functionality: offline-only, showing messages as normal but only file export is enabled. (We could go even more granular: offline-only only being applied if the whonix update failed)
To elaborate: opening or printing files (same thing) sent by sources when we know there may be a security update waiting for us is probably a bad idea, even if the chances that the file someone is working with that session has that exact exploit are low. Simultaneously, leaving our users with no access to their data is less than ideal UX. How much effort it would be to implement this in the client I don't know, but if it's straight-forward, I'd love if we considered it.
This less-than-ideal UX is the current behavior. I believe it's justified by (an admittedly distant) extension of the same logic: even the export/print flows could be compromised by vulnerabilities that pending updates might fix.
If we should decide to break with that cautious approach, I think we could come up with a way (a feature and/or QubesDB value) to expose the system/VM's update state to sd-app
in order to put the Client into a particular mode. But I'd want to map that out very carefully: Is this system-level lockdown mode the same as the application-level offline mode? If not, how are they different?
export/print flows could be compromised by vulnerabilities that pending updates might fix
I think SecEng input would be useful here (poke @lsd-cat), but I don't think I agree with the export bit. Printing, yes, because in many (if not most) cases printing involves opening and converting the file, but exporting it as implemented now is really just copying the file to another disk. My understanding is that the exploit surface there is very low? At least for the SDW device, and for downstream devices we can't do anything either way. If we'd do some malware checks beforehand then that'd change the calculation, but I'm not sure I'm into that idea personally.
With regards to implementation though, yep, no question about needing to be careful.
Depending if exporting also involves archiving/unarchiving files with arbitrary or controllable names I tend to agree that the attack surface of a copy operation, given that the copy operation is the only thing that happens and path are safely handled, is practically none. For the printing flow, it depends a lot on what gets printed and how, but I do not have enough knowledge of our processes to assess that now.
Here are my findings, building upon previous discussions.
I'll start off with some good news. Expectedly, the 4.2 updater has the potential to replace most of the SDW Launcher functionality with some minor tweaks. The only thing that it can't do is multi-VM stuff like migrations and system state validation.
The 4.2 Qubes Updater has a more pleasant UI and importantly it doesn't rely on saltstack and does show progress updates. Here are some other relevant aspects:
More importantly for us:
--restart
forcing restarts of app qubes when their template has been updated--update-if-stale
replaces the logic for checking if the qube if it hasn't been updated in a while. Important: the minimum value is 1 day. Currently SDW checks 8h. We may need to reevaluate this.--targets
allows us to specify which VMs we want. Has some limitations: incompatible with --update-if-stale
and doesn't allow us to target by tag (like policies do).So in summary, the updater could replace all the non-migration related upgrades, including restarting app qubes and ensuring no app qube goes without updates for more than N days. On top of this, it provides a user interface with progress bars!
I have built upon the previous updater flow and removed some parts that I found that the updater can do for us now:
Changes since the previous design:
Move the system integrity check
before the Qubes updater call
If the system's integrity has been tampered with, it's not worth updating. This hasn't been implemented yet AFAIK (see #951). This is tracked in https://github.com/freedomofpress/securedrop-workstation/issues/293.
(re-)add apply migration
step
The updater can't do this (yet), so I added it back. I made this a mutually exclusive step with system integrity check
because a migration already enforces state, thus checking it.
Add entry point: SDW Notify Neither the original design nor the updated one include this, but looking at the code I came across this pre-flight checker entry point which happens when the system has been running for too long and thus the system should be updated. This is important to include since it is what creates the requirement to force workstation-related app qube VM restarts.
Remove Restart VMs if needed
step
This can be enforced by the updater instead.
--target
improvements
--update-if-stale
parameter (i.e. updating the stale ones as well within the provided targets)@adminvm
, @tag:sd-workstation
, etc.)--restart
option does not force VM restarts (VMs to restart show as unselected)The updater has settings. In these settings various values can be changed:
maybe
updatesWe should check that when overriding these values via CLI parameters, user preferences do not override them.
Hey @deeplow, this is great, thank you. Super great writeup and I like the flowchart.
Couple followup questions I'm curious your opinion on. The things we want to accomplish are: a) know if updates succeeded; b) know for which VMs updates failed, if any, and c) prevent the user from starting SDW VMs unless they are updated (ideally without depending on user interaction, i.e. without needing the user to explicitly click "Restart VMs" after the updater run is completed). Do you think we can do the above programatically?
re the restart possible bug - When I try qubes-vm-update --restart
with an appvm running whose template needs updates, IME the appvm does indeed restart at the end of the update process. However.... it restarts whether or not the template upgrade was successful :( So it's a different problem that I'm noticing. (I'm going to file an upstream ticket for that)
re the exit status: on the cli, if template updates fail I receive a non-zero exit status (2 in my case).
a) know if updates succeeded;
Via exit code, I'd say is the easiest way.
b) know for which VMs updates failed, if any
This is a bit more tricky since since the updater doesn't report anything back to the terminal. We could check via qvm-prefs which ones have an update timestamp after the checks started. But I'd like the understand what creates this requirement. Is it for logging purposes? Or for telling the user something (if it's the latter, I'd argue for that to be the updater's responsibility and the SDW launcher to only be concerned about whether or not all updates succeeded).
c) prevent the user from starting SDW VMs unless they are updated (ideally without depending on user interaction, i.e. without needing the user to explicitly click "Restart VMs" after the updater run is completed). Do you think we can do the above programmatically?
I think doing this goes beyond what the updater can handle. We're talking here about preventing the user from going to the menu and starting app qubes on their own. So I'd say this is is out of scope for the updater. I think it's entirely possible. Maybe with a modal blocking the screen until updates are performed (I did something a bit like this in the past in my onboarding tutorial). Or perhaps another approach is to have a program in the VMs that shut them down immediately after starting if updates were not applied recently (with UX considerations ofc). Anyways, it's entirely possible.
Shall I create an issue for that?
re the exit status: on the cli, if template updates fail I receive a non-zero exit status (2 in my case).
Interesting. I might have to check the code paths because that's not what I am getting. I just updated one template and killed it immediately, triggering the updater to mark that one with a :x: and the exit code was zero.
re the restart possible bug - When I try qubes-vm-update --restart with an appvm running whose template needs updates, IME the appvm does indeed restart at the end of the update process. However.... it restarts whether or not the template upgrade was successful :( So it's a different problem that I'm noticing. (I'm going to file an upstream ticket for that)
Thanks for filing it. Linking it here https://github.com/QubesOS/qubes-issues/issues/9031) just so we don't loose track of it.
I see what's happening here. I'm using the GUI (qubes-update-gui
) and you're using the CLI (qubes-vm-update
). So it turns out that on the GUI option the --restart
doesn't work. I filed an upstream issue on this as well https://github.com/QubesOS/qubes-issues/issues/9032.
We also have a choice to make, which I had not presented before. We can choose to use the GUI updater program or the CLI. If we go with the GUI, the UI is already done and is transparent to the users what's happening.
On the other hand if we go with the CLI, we may obtain some feedback of what the updater is doing but the user will be clueless as to what's happening in the background (akin to how the SDW laucher currently works).
We could also have a middle-ground solution of using the CLI updater and surfacing some of the progressions. But that would be a bit re-inventing the wheel.
Yeah, I was specifically testing out the cli because you'd mentioned you'd been testing the GUI more heavily, but aside from how the user interacts, I would expect ~ behaviour parity between the two.
Right now, we do basically restrict the user from launching the client unless VMs are updated, and that's the main purpose of our updater. (The desktop shortcut launches the client directly if update timestamp is recent enough, or the updater otherwise, and we don't include a menu entry for launching the client. Of course, a very determined user could start an sd-app terminal and type securedrop-client
to bypass, but this is clearly not the happy path, and our main goal is to make the easy/happy path have as little room for user error/choice around updates as possible).
So my hope would be to invoke the Qubes updater in the same manner. (User tries to launch the client -> updates begin automatically if required, client launches iff updates successful.) I agree that it's harder to use the GUI to implement conditional logic (and it makes sense to me that a gui app would always exit with 0 unless the gui portion itself crashed), but I'm hopeful we can either use the cli the same way, and/or read from prefs as you suggest and re-launch the GUI updater if we decide that's better, but either way recreate the flow that for now, non-updated vms means the client won't launch.
(fwiw: the gui doesn't seem to miss the "restart" option for me; maybe we can briefly touch base about what you're seeing?)
Yeah, I was specifically testing out the cli because you'd mentioned you'd been testing the GUI more heavily,
Good thinking. Thanks for that!
Right now, we do basically restrict the user from launching the client unless VMs are updated, and that's the main purpose of our updater.
Great! Then this should be solved already.
So my hope would be to invoke the Qubes updater in the same manner. (User tries to launch the client -> updates begin automatically if required, client launches iff updates successful.)
That's also what I hope we can accomplish! :slightly_smiling_face:
Had a pairing chat with @deeplow today, just putting our notes/work plan here. He plans to take point on this issue, and I will be available for rubber-ducking/review/assistance throughout. We discussed that the next step will be a pseudocode implementation that preserves the underlying logic (below), and if there's any big changes or hiccups we'll catch it early.
update logic:
- [on startup] remove reboot flag if set
- [user clicks sd desktop icon]
- read last update status from disk. ("Success", "Fail", "Reboot req'd")
- check reboot flag if so prompt for reboot
- if failure then launch updater
- check network access (no network: updater should fail immediately)
- network access: try updating (meaning: update dom0's packages; update vms's packages)
- updates succeed and no reboot needed:
- success timestamp written to disk, success status written to disk
- launch client
- updates succeed and reboot needed (dom0 update):
- reboot flag written to disk, update success timestamp written to disk
- success screen prompts user to reboot
- updates fail: error
- error status written to disk
- log error (launcher, launcher-detail)
- report to user
- if success (good update and update timestamp present and done inside our time window) then launch the client
Open questions/discussed and agreed on points:
Explicitly out of scope for this feature:
provision all
script we can have sub components)Thanks for the summary @rocodes. Per your suggestion, I'll implement the CLI until the GUI bugs are fixed. I can't foresee many complications in the original. So this is what I think the changes in this first iteration will look like:
- [on startup] remove reboot flag if set
- [user clicks sd desktop icon]
- read last update status from disk. ("Success", "Fail", "Reboot req'd")
- check reboot flag if so prompt for reboot
- if failure then launch updater
- check network access (no network: updater should fail immediately)
- - network access: try updating (meaning: update dom0's packages; update vms's packages)
+ - if network access:
+ - check for dom0 updates, then if they're available, try updating dom0 (via a native qubes update mechanism)
+ - if fails: report error and bail, don't continue with the rest of the update attempt
+ - elif succeeds:
+ - write dom0 needs_reboot status to disk
+ - if needs_migration:
+ - do migration/orchestration stuff (native qubes updater is not responsible for this)
+ - continue
+ - check for vm updates via native qubes update mechanism.
+ - targets: sd-templates, sys-usb, sys-net, sys-firewall and dom0
+ - write status to disk
+ - --restart
- updates succeed and no reboot needed:
- success timestamp written to disk, success status written to disk
- launch client
- updates succeed and reboot needed (dom0 update):
- reboot flag written to disk, update success timestamp written to disk
- success screen prompts user to reboot
- updates fail: error
- error status written to disk
- log error (launcher, launcher-detail)
- report to user
- if success (good update and update timestamp present and done inside our time window) then launch the client
One side-effect of this is that it will update the templates no matter what. We're not telling the update to selectively update the ones which are stale (--update-if-stale
). This is due to two reasons:
--update-if-stale
is incompatible with --targets
as I had noted aboveHandling migration (switching base templates, things covered by qvm-prefs, etc, aka "orchestration") - we won't aim to solve that with this right now
Actually, we may need to include part of this in scope. When there is a need for migration, does it happen before the update or after the update? And how do we get notified of a migration need? If it's through dom0 then we may hit the case where we need to (1) update dom0 (2) detect migration (3) migrate and (4) apply updates to VMs. Otherwise, newly migrated VMs may not have updates applied, which is especially important to ensure since we're considering cloning the base templates.
Hi @deeplow, I came back to add the restart
thing per our convo but looks like you did it already so ty :)
I intended the writeup to be implementation-agnostic, so that we could always refer back to it to understand the core logic (eg if we ever switch from cli to gui), but yes I agree with your additions, with one nitpick: dom0 updates and check should come before vm updates, sorry my semicolon list was working a little too hard in the above notes I think. (To your other point, this is where the migration/orchestration stuff will enter into the picture).
So if you're breaking that part down I think it's more like
[...]
- check network access (if no network access: updater should fail immediately)
- if network access:
- check for dom0 updates, then if they're available, try updating dom0 (via a native qubes update mechanism)
- if fails: report error and bail, don't continue with the rest of the update attempt
- elif succeeds:
- write dom0 needs_reboot status to disk
- if needs_migration:
- do migration/orchestration stuff (native qubes updater is not responsible for this)
- continue
- check for vm updates via native qubes update mechanism. targets: sd-templates, sys-usb, sys-net, sys-firewall
- write status to disk
- --restart
I think in the best case, we wouldn't restart the vms/templates if no updates were applied, but that's an eventual optimization that is not essential/important compared to the overall logic + consistency (and maybe it's something we can eventually get from upstream).
The needs_migration
part currently goes like this: if we need to do any 'orchestration' in dom0 and force a full run of our provisioning logic, we drop a (true) flag file during dom0 updates that is installed in basically the postinst of the new dom0 rpm, and then picked up on by the updater to decide if the whole salt update will run.
If we can accomplish this in all one updater run, that's great, but the way I look at it it's two separate runs, because there may be an 'orchestration' stage in between the runs:
For the sake of scoping of this feature, I think it makes sense to leave that migration stuff as is for now and just focusing on replacing our update logic with (a) calll(s) to qubes updater.
We will be able to improve on the migration part as we make more progress on other areas of our architecture overhaul (for example, if more things are in packages, fewer things require a migration; and if orchestration is a bit more modular then we don't have to use the blunt instrument of "run all provisioning" vs "run no provisioning," but we can be a bit more granular about it).
It makes sense the update dom0 part first. It's actually in line with my comment above:
If it's through dom0 then we may hit the case where we need to (1) update dom0 (2) detect migration (3) migrate and (4) apply updates to VMs.
So in the end we'll probably need to do two updater calls.
I have updated my proposed changes above with your notes.
I have updated the diagram according to these new changes (plus I added the "check if we should update" part).
Thinking a bit more about the dom0 updates first part. This has some UX implications which requires adapting the current user interface. In particular I think we'll have to:
Do we want to have the updater to be fully unattended? Just click start (grab a cup of coffee, etc.)
I have made some mock ups:
I'm thinking that the dom0 updates could be handled internally without exposing anything to the user. This means that what used to a continuous progress bar, should probably go away.
Looks good! would be good to add error states in there as well.
Qubes 4.2 ships with a new GUI and CLI updater that no longer relies on Salt to update VMs. This means that our existing update method no longer works, but also presents an opportunity to more heavily rely on Qubes' built-in update facilities. The purpose of this issue is to:
(Issue placeholder fleshed out by @eloquence, 2023-06-08)