freedomofpress / securedrop-workstation

Qubes-based SecureDrop Journalist Workstation environment for submission handling
GNU Affero General Public License v3.0
141 stars 43 forks source link

Investigate using Qubes updater in 4.2 #899

Closed zenmonkeykstop closed 5 months ago

zenmonkeykstop commented 1 year ago

Qubes 4.2 ships with a new GUI and CLI updater that no longer relies on Salt to update VMs. This means that our existing update method no longer works, but also presents an opportunity to more heavily rely on Qubes' built-in update facilities. The purpose of this issue is to:

(Issue placeholder fleshed out by @eloquence, 2023-06-08)

eloquence commented 1 year ago

The following flow chart describes the current SecureDrop Workstation preflight updater behavior:

Screenshot from 2023-06-08 10-40-18

view/edit

  1. The updater starts upon user login
  2. If and only if updates have been applied within the last 8 hours, skips ahead to launch SecureDrop Client.
  3. Otherwise, it forces an updater run that will perform operations in this order:
    • [Update] updates dom0
    • [State enforcement, used for some migrations] applies dom0 Salt states (always!)
    • [Full migration] if a migration flag is set, also runs sdw-admin --apply
    • [Update] Updates the following VMs via Salt: Fedora base template, SecureDrop Workstation large, SecureDrop Workstation small, Whonix gateway base template
    • Performs VM reboots
    • Stores various update metadata on-disk

Updates are performed sequentially.

  1. If dom0 updates were applied, enforces a reboot
  2. If reboot has been performed or no reboot required:
    • all good: launches the SecureDrop Client
    • any error: prints error message, does not launch SecureDrop Client
eloquence commented 1 year ago

The Qubes 4.2 GUI updater provides important new facilities that, in principle, give us the option to migrate to the OS-level updater:

It also allows for concurrency and should generally be significantly faster than the old update method.

The following flow chart describes what a "minimum viable product" migration to the Qubes 4.2 updater could look like. The details are subject to various team-level decisions, however.

Screenshot from 2023-06-08 15-20-56

view/edit

  1. Our now much slimmer preflight code runs upon user login and checks the update state of VMs we keep track of (based own Qubes itself tracking this state)
  2. If updates haven't been recently applied, we display a brief message to the user asking them to complete a system update.
  3. Once the user proceeds, we launch the 4.2 updater. Migrations can run as postinst code if the dom0 package is updated, using the mechanics described in https://github.com/freedomofpress/securedrop-updater/tree/main/migrations
  4. We check that the system is in the expected state (i.e. that expected VM updates have been completed, and that migrations have run successfully).
  5. We proceed much like in the current updater flow, except that we check if the expected VMs have in fact been restarted. If not, we do so before launching the SecureDrop Client.

(Much credit to @eaon for workshopping this with me.)

eloquence commented 1 year ago

There are a few important decision-points that should influence any "MVP" migration to the 4.2 updater. Here are some we identified:

@eaon, please chime in if I'm missing stuff, if I misunderstood anything from our conversation, or otherwise. :) Beyond that, I think we're nearing the end of our current cycle commitment as far as the updater is concerned, and can move most of the in-depth discussion of these questions into the planning phase for the next cycle.

eaon commented 1 year ago

I think you covered everything we talked about, thanks a lot for summarising it in a concise manner!

The only thing I'd want to expand on at this stage (to give others the chance to think about it):

If updates fail, do we want to continue to prevent launching the SecureDrop Client in all error states? The MVP sketch above does not go into detail on this; we could be a bit more selective on which failures we consider fatal.

I think given how SDW works, I would continue to treat update failures as fatal. We can't know if the updates are security updates or not, and I'd err on the side of caution. However: there is a potential to allow the client to be started with limited functionality: offline-only, showing messages as normal but only file export is enabled. (We could go even more granular: offline-only only being applied if the whonix update failed)

To elaborate: opening or printing files (same thing) sent by sources when we know there may be a security update waiting for us is probably a bad idea, even if the chances that the file someone is working with that session has that exact exploit are low. Simultaneously, leaving our users with no access to their data is less than ideal UX. How much effort it would be to implement this in the client I don't know, but if it's straight-forward, I'd love if we considered it.

Back to the updater though: my understanding of Qubes Updater is that far more often than not we will know that there's updates, or know that there were none at the time of the most recent staleness check (which may be a subtle detail we want to discuss too), rather than learning about them every day when we enforce an update attempt. So with sourcing out our "check for updates and install them" behaviour to Qubes Updater, we will have a dramatically lower failure rate as we just don't need to run a whole "check for updates and install them" command as often, since it will happen automagically in the background.

cfm commented 1 year ago

@eaon, I largely agree with you in https://github.com/freedomofpress/securedrop-workstation/issues/899#issuecomment-1587555406 and in turn with the MVP positions @eloquence suggests in https://github.com/freedomofpress/securedrop-workstation/issues/899#issuecomment-1583560608.

If updates fail, do we want to continue to prevent launching the SecureDrop Client in all error states? The MVP sketch above does not go into detail on this; we could be a bit more selective on which failures we consider fatal.

I think given how SDW works, I would continue to treat update failures as fatal. We can't know if the updates are security updates or not, and I'd err on the side of caution.

I agree with this too.

However: there is a potential to allow the client to be started with limited functionality: offline-only, showing messages as normal but only file export is enabled. (We could go even more granular: offline-only only being applied if the whonix update failed)

To elaborate: opening or printing files (same thing) sent by sources when we know there may be a security update waiting for us is probably a bad idea, even if the chances that the file someone is working with that session has that exact exploit are low. Simultaneously, leaving our users with no access to their data is less than ideal UX. How much effort it would be to implement this in the client I don't know, but if it's straight-forward, I'd love if we considered it.

This less-than-ideal UX is the current behavior. I believe it's justified by (an admittedly distant) extension of the same logic: even the export/print flows could be compromised by vulnerabilities that pending updates might fix.

If we should decide to break with that cautious approach, I think we could come up with a way (a feature and/or QubesDB value) to expose the system/VM's update state to sd-app in order to put the Client into a particular mode. But I'd want to map that out very carefully: Is this system-level lockdown mode the same as the application-level offline mode? If not, how are they different?

eaon commented 1 year ago

export/print flows could be compromised by vulnerabilities that pending updates might fix

I think SecEng input would be useful here (poke @lsd-cat), but I don't think I agree with the export bit. Printing, yes, because in many (if not most) cases printing involves opening and converting the file, but exporting it as implemented now is really just copying the file to another disk. My understanding is that the exploit surface there is very low? At least for the SDW device, and for downstream devices we can't do anything either way. If we'd do some malware checks beforehand then that'd change the calculation, but I'm not sure I'm into that idea personally.

With regards to implementation though, yep, no question about needing to be careful.

lsd-cat commented 1 year ago

Depending if exporting also involves archiving/unarchiving files with arbitrary or controllable names I tend to agree that the attack surface of a copy operation, given that the copy operation is the only thing that happens and path are safely handled, is practically none. For the printing flow, it depends a lot on what gets printed and how, but I do not have enough knowledge of our processes to assess that now.

deeplow commented 8 months ago

Here are my findings, building upon previous discussions.

I'll start off with some good news. Expectedly, the 4.2 updater has the potential to replace most of the SDW Launcher functionality with some minor tweaks. The only thing that it can't do is multi-VM stuff like migrations and system state validation.

Qubes 4.2 Updater Summary

The 4.2 Qubes Updater has a more pleasant UI and importantly it doesn't rely on saltstack and does show progress updates. Here are some other relevant aspects:

More importantly for us:

So in summary, the updater could replace all the non-migration related upgrades, including restarting app qubes and ensuring no app qube goes without updates for more than N days. On top of this, it provides a user interface with progress bars!

Updated 4.2 Flow Proposal

I have built upon the previous updater flow and removed some parts that I found that the updater can do for us now:

Screenshot 2024-03-08 at 13-32-11 Online FlowChart   Diagrams Editor - Mermaid Live Editor

View/Edit

Changes since the previous design:

Qubes Updater Improvement Suggestions

Settings overriding possibility

The updater has settings. In these settings various values can be changed:

We should check that when overriding these values via CLI parameters, user preferences do not override them.

rocodes commented 8 months ago

Hey @deeplow, this is great, thank you. Super great writeup and I like the flowchart.

Couple followup questions I'm curious your opinion on. The things we want to accomplish are: a) know if updates succeeded; b) know for which VMs updates failed, if any, and c) prevent the user from starting SDW VMs unless they are updated (ideally without depending on user interaction, i.e. without needing the user to explicitly click "Restart VMs" after the updater run is completed). Do you think we can do the above programatically?

re the restart possible bug - When I try qubes-vm-update --restart with an appvm running whose template needs updates, IME the appvm does indeed restart at the end of the update process. However.... it restarts whether or not the template upgrade was successful :( So it's a different problem that I'm noticing. (I'm going to file an upstream ticket for that)

re the exit status: on the cli, if template updates fail I receive a non-zero exit status (2 in my case).

deeplow commented 8 months ago

a) know if updates succeeded;

Via exit code, I'd say is the easiest way.

b) know for which VMs updates failed, if any

This is a bit more tricky since since the updater doesn't report anything back to the terminal. We could check via qvm-prefs which ones have an update timestamp after the checks started. But I'd like the understand what creates this requirement. Is it for logging purposes? Or for telling the user something (if it's the latter, I'd argue for that to be the updater's responsibility and the SDW launcher to only be concerned about whether or not all updates succeeded).

c) prevent the user from starting SDW VMs unless they are updated (ideally without depending on user interaction, i.e. without needing the user to explicitly click "Restart VMs" after the updater run is completed). Do you think we can do the above programmatically?

I think doing this goes beyond what the updater can handle. We're talking here about preventing the user from going to the menu and starting app qubes on their own. So I'd say this is is out of scope for the updater. I think it's entirely possible. Maybe with a modal blocking the screen until updates are performed (I did something a bit like this in the past in my onboarding tutorial). Or perhaps another approach is to have a program in the VMs that shut them down immediately after starting if updates were not applied recently (with UX considerations ofc). Anyways, it's entirely possible.

Shall I create an issue for that?

re the exit status: on the cli, if template updates fail I receive a non-zero exit status (2 in my case).

Interesting. I might have to check the code paths because that's not what I am getting. I just updated one template and killed it immediately, triggering the updater to mark that one with a :x: and the exit code was zero.

re the restart possible bug - When I try qubes-vm-update --restart with an appvm running whose template needs updates, IME the appvm does indeed restart at the end of the update process. However.... it restarts whether or not the template upgrade was successful :( So it's a different problem that I'm noticing. (I'm going to file an upstream ticket for that)

Thanks for filing it. Linking it here https://github.com/QubesOS/qubes-issues/issues/9031) just so we don't loose track of it.

I see what's happening here. I'm using the GUI (qubes-update-gui) and you're using the CLI (qubes-vm-update). So it turns out that on the GUI option the --restart doesn't work. I filed an upstream issue on this as well https://github.com/QubesOS/qubes-issues/issues/9032.

Updater GUI or Updater CLI?

We also have a choice to make, which I had not presented before. We can choose to use the GUI updater program or the CLI. If we go with the GUI, the UI is already done and is transparent to the users what's happening.

On the other hand if we go with the CLI, we may obtain some feedback of what the updater is doing but the user will be clueless as to what's happening in the background (akin to how the SDW laucher currently works).

We could also have a middle-ground solution of using the CLI updater and surfacing some of the progressions. But that would be a bit re-inventing the wheel.

rocodes commented 8 months ago

Yeah, I was specifically testing out the cli because you'd mentioned you'd been testing the GUI more heavily, but aside from how the user interacts, I would expect ~ behaviour parity between the two.

Right now, we do basically restrict the user from launching the client unless VMs are updated, and that's the main purpose of our updater. (The desktop shortcut launches the client directly if update timestamp is recent enough, or the updater otherwise, and we don't include a menu entry for launching the client. Of course, a very determined user could start an sd-app terminal and type securedrop-client to bypass, but this is clearly not the happy path, and our main goal is to make the easy/happy path have as little room for user error/choice around updates as possible).

So my hope would be to invoke the Qubes updater in the same manner. (User tries to launch the client -> updates begin automatically if required, client launches iff updates successful.) I agree that it's harder to use the GUI to implement conditional logic (and it makes sense to me that a gui app would always exit with 0 unless the gui portion itself crashed), but I'm hopeful we can either use the cli the same way, and/or read from prefs as you suggest and re-launch the GUI updater if we decide that's better, but either way recreate the flow that for now, non-updated vms means the client won't launch.

(fwiw: the gui doesn't seem to miss the "restart" option for me; maybe we can briefly touch base about what you're seeing?)

deeplow commented 8 months ago

Yeah, I was specifically testing out the cli because you'd mentioned you'd been testing the GUI more heavily,

Good thinking. Thanks for that!

Right now, we do basically restrict the user from launching the client unless VMs are updated, and that's the main purpose of our updater.

Great! Then this should be solved already.

So my hope would be to invoke the Qubes updater in the same manner. (User tries to launch the client -> updates begin automatically if required, client launches iff updates successful.)

That's also what I hope we can accomplish! :slightly_smiling_face:

rocodes commented 7 months ago

Had a pairing chat with @deeplow today, just putting our notes/work plan here. He plans to take point on this issue, and I will be available for rubber-ducking/review/assistance throughout. We discussed that the next step will be a pseudocode implementation that preserves the underlying logic (below), and if there's any big changes or hiccups we'll catch it early.

update logic:
    - [on startup] remove reboot flag if set
    - [user clicks sd desktop icon]
         - read last update status from disk. ("Success", "Fail", "Reboot req'd")
           - check reboot flag if so prompt for reboot
           - if failure then launch updater
               - check network access (no network: updater should fail immediately)
               - network access: try updating (meaning: update dom0's packages; update vms's packages)
                   - updates succeed and no reboot needed: 
                     - success timestamp written to disk, success status written to disk
                     - launch client
                   - updates  succeed and reboot needed (dom0 update):
                     - reboot flag written to disk, update success timestamp written to disk
                     - success screen prompts user to reboot
               - updates fail: error
                 - error status written to disk 
                 - log error (launcher, launcher-detail)
                 - report to user
           - if success (good update and update timestamp present and done inside our time window) then launch the client

Open questions/discussed and agreed on points:

Explicitly out of scope for this feature:

deeplow commented 7 months ago

Thanks for the summary @rocodes. Per your suggestion, I'll implement the CLI until the GUI bugs are fixed. I can't foresee many complications in the original. So this is what I think the changes in this first iteration will look like:

    - [on startup] remove reboot flag if set
    - [user clicks sd desktop icon]
         - read last update status from disk. ("Success", "Fail", "Reboot req'd")
           - check reboot flag if so prompt for reboot
           - if failure then launch updater
               - check network access (no network: updater should fail immediately)
-               - network access: try updating (meaning: update dom0's packages; update vms's packages)
+               - if network access:
+                   - check for dom0 updates, then if they're available, try updating dom0 (via a native qubes update mechanism)
+                       - if fails: report error and bail, don't continue with the rest of the update attempt
+                       - elif succeeds:
+                           - write dom0 needs_reboot status to disk 
+                           - if needs_migration:
+                               - do migration/orchestration stuff (native qubes updater is not responsible for this)
+                               - continue    
+                   - check for vm updates via native qubes update mechanism.
+                     - targets: sd-templates, sys-usb, sys-net, sys-firewall and dom0
+                     - write status to disk 
+                     - --restart
                   - updates succeed and no reboot needed: 
                     - success timestamp written to disk, success status written to disk
                     - launch client
                   - updates  succeed and reboot needed (dom0 update):
                     - reboot flag written to disk, update success timestamp written to disk
                     - success screen prompts user to reboot
               - updates fail: error
                 - error status written to disk 
                 - log error (launcher, launcher-detail)
                 - report to user
           - if success (good update and update timestamp present and done inside our time window) then launch the client

One side-effect of this is that it will update the templates no matter what. We're not telling the update to selectively update the ones which are stale (--update-if-stale). This is due to two reasons:

Handling migration (switching base templates, things covered by qvm-prefs, etc, aka "orchestration") - we won't aim to solve that with this right now

Actually, we may need to include part of this in scope. When there is a need for migration, does it happen before the update or after the update? And how do we get notified of a migration need? If it's through dom0 then we may hit the case where we need to (1) update dom0 (2) detect migration (3) migrate and (4) apply updates to VMs. Otherwise, newly migrated VMs may not have updates applied, which is especially important to ensure since we're considering cloning the base templates.

rocodes commented 7 months ago

Hi @deeplow, I came back to add the restart thing per our convo but looks like you did it already so ty :)

I intended the writeup to be implementation-agnostic, so that we could always refer back to it to understand the core logic (eg if we ever switch from cli to gui), but yes I agree with your additions, with one nitpick: dom0 updates and check should come before vm updates, sorry my semicolon list was working a little too hard in the above notes I think. (To your other point, this is where the migration/orchestration stuff will enter into the picture).

So if you're breaking that part down I think it's more like

[...]
               - check network access (if no network access: updater should fail immediately)
               - if network access:
                   - check for dom0 updates, then if they're available, try updating dom0 (via a native qubes update mechanism)
                       - if fails: report error and bail, don't continue with the rest of the update attempt
                       - elif succeeds:
                           - write dom0 needs_reboot status to disk 
                           - if needs_migration:
                               - do migration/orchestration stuff (native qubes updater is not responsible for this)
                               - continue    
                   - check for vm updates via native qubes update mechanism. targets: sd-templates, sys-usb, sys-net, sys-firewall
                       - write status to disk 
                       - --restart

I think in the best case, we wouldn't restart the vms/templates if no updates were applied, but that's an eventual optimization that is not essential/important compared to the overall logic + consistency (and maybe it's something we can eventually get from upstream).

The needs_migration part currently goes like this: if we need to do any 'orchestration' in dom0 and force a full run of our provisioning logic, we drop a (true) flag file during dom0 updates that is installed in basically the postinst of the new dom0 rpm, and then picked up on by the updater to decide if the whole salt update will run.

If we can accomplish this in all one updater run, that's great, but the way I look at it it's two separate runs, because there may be an 'orchestration' stage in between the runs:

For the sake of scoping of this feature, I think it makes sense to leave that migration stuff as is for now and just focusing on replacing our update logic with (a) calll(s) to qubes updater.

We will be able to improve on the migration part as we make more progress on other areas of our architecture overhaul (for example, if more things are in packages, fewer things require a migration; and if orchestration is a bit more modular then we don't have to use the blunt instrument of "run all provisioning" vs "run no provisioning," but we can be a bit more granular about it).

deeplow commented 7 months ago

It makes sense the update dom0 part first. It's actually in line with my comment above:

If it's through dom0 then we may hit the case where we need to (1) update dom0 (2) detect migration (3) migrate and (4) apply updates to VMs.

So in the end we'll probably need to do two updater calls.

I have updated my proposed changes above with your notes.

deeplow commented 7 months ago

I have updated the diagram according to these new changes (plus I added the "check if we should update" part).

Screenshot 2024-03-22 at 13-57-30 Online FlowChart   Diagrams Editor - Mermaid Live Editor

View/Edit

deeplow commented 7 months ago

Unfolding the Updater's UI

Thinking a bit more about the dom0 updates first part. This has some UX implications which requires adapting the current user interface. In particular I think we'll have to:

  1. Remove the progress bar
  2. Update dom0 in background (without the updater GUI)
  3. Have a button to launch the Qubes updater. If we start it automatically after updating dom0, the user may be confused. On the other hand, it's requires one more click, which can be annoying for something that happens every day.

Do we want to have the updater to be fully unattended? Just click start (grab a cup of coffee, etc.)

I have made some mock ups:

I'm thinking that the dom0 updates could be handled internally without exposing anything to the user. This means that what used to a continuous progress bar, should probably go away.

Board

Board(1)

zenmonkeykstop commented 7 months ago

Looks good! would be good to add error states in there as well.