Figure out noble upgrade cadence plan

legoktm commented 6 days ago

Description

Instead of upgrading every single instance at the exact same time (once we push a deb), I think it would be better to do some sort of staged rollout.

My proposal would be that on package upgrade, each instance generates a random number (1-5) and stores it somewhere. In theory we've now split all the securedrop servers into 5 groups.

Then, in another file we ship with the package (possibly the upgrade script itself) we have a number we control. if we set it to 1, we'll upgrade ~20% of servers. Then we can do another deb package release to bump it to 2 to upgrade ~40% of all servers. And so on.

I also think this mechanism should be split for both app and mon. We should upgrade all mon servers to 100% and then do all the app servers.

legoktm commented 6 days ago

Also, for instances with hands-on administrators, we can give them a heads up and let them manually run the migration script before our auto/forced migration.

nathandyer commented 6 days ago

(Early thoughts, not fully formed)

I like the idea of admins being in control of the migration, unless there's a situation where there's not a hands-on admin and we run up against the deadline.

What about an alternate approach that might look like:

0) We select a hard deadline date for the auto upgrade (for arguments sake, March 1st). Prior to March 1st, Admins can manually kick off the upgrade from an Admin Workstation. The mechanism for which might be:

1) We publish the package with the noble upgrade script on a new (temporary?) Apt server 2) We update securedrop-admin on the Admin Workstations with a securedrop-admin noble-upgrade command, which essentially adds the new Apt server to the sources list on both app and mon. 3) Admins can manually update before the deadline date 4) When the deadline happens, we promote the packages to the normal apt prod servers and "force" the update 5) We retire the temporary Apt server

zenmonkeykstop commented 6 days ago

is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong? If so, (building on @nathandyer's proposal) we kindof get that already if we give folks the option to migrate ahead of time and get the feedback of the first ones off the ice.

zenmonkeykstop commented 6 days ago

We don't need temporary apt servers or anything tho - we can ship the changes packaged as normal and just have EOL checks.

legoktm commented 6 days ago

is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong

Yes, and (if things go poorly) we shouldn't take down every single SecureDrop all at once.

legoktm commented 1 day ago

To merge Nathan's proposal with mine:

We ship debs and admin workstation code that installs the upgrade scripts but doesn't do anything automatically.
Admins can do ./securedrop-admin noble-upgrade until some set deadline. We recommend this, but don't require it.
After deadline passes, we push debs to enable the upgrade process to run automatically on mon servers. Depending on how many instances are left, we split this up into batches.
Once we've finished mon servers, we repeat for app servers.

cfm commented 1 day ago

It looks like APT has built-in support for phased updates. Could that work for us, so we don't have to implement it ourselves? (This might be worth looking into generally.)

legoktm commented 1 day ago

Thanks for flagging that, unfortunately focal's apt doesn't support phasing so it's not an option for us here, but it will become an option once we do upgrade to noble so let me file a separate task for that.

legoktm commented 1 day ago

One point made in today's team meeting is that the admin-instigated upgrade period will give us a good sense of how robust the upgrade process is and inform how important spreading out the upgrades are.

Another thing I clarified is that the point of having mon go before app is so that we have a consistent state to test against. I don't want both servers upgrading at the same time, in a weird undefined/hard to test state. So one should go first, and then we upgrade the second. No strong opinion on whether it's app or mon, but that it's a defined order we can replicate during testing.

cfm commented 3 hours ago

On Wed, Nov 13, 2024 at 12:31:42PM -0800, Kunal Mehta wrote:

Another thing I clarified is that the point of having mon go before app is so that we have a consistent state to test against. I don't want both servers upgrading at the same time, in a weird undefined/hard to test state. So one should go first, and then we upgrade the second. No strong opinion on whether it's app or mon, but that it's a defined order we can replicate during testing.

I agree, @legoktm. In any cloud deployment we would be able to stagger these, but our Ansible playbooks effectively run parallel to the Application and Monitor Servers at each step.

In the automatic scenario, how will we (and an administrator) know that a Monitor Server has been upgraded successfully? No "/metadata" endpoint to monitor there.

legoktm commented 2 hours ago

In any cloud deployment we would be able to stagger these, but our Ansible playbooks effectively run parallel to the Application and Monitor Servers at each step.

To clarify, even for the manual administrator-initiated upgrade, I would still want to do them in series (mon first, then app).

In the automatic scenario, how will we (and an administrator) know that a Monitor Server has been upgraded successfully? No "/metadata" endpoint to monitor there.

We/FPF will have no visibility into mon upgrades (maybe we can peek at apt-prod web request logs I guess).

For admins we'll send some sort of message via OSSEC alerts (i.e. logger.error("mon server has been upgraded")).

zenmonkeykstop commented 33 minutes ago

Doing it for the manual updates is probably easier, you can just modify Ansible's inventory. Though there might be some refactoring necessary if roles depend on info shared between app and mon.

But it still largely feels over-engineered for the automated case to me:

if the update fails on mon admins should shutdown app anyway, so downtime would be unavoidable.
if admins are just letting the updates run automatically, and it fails on mon, they won't get an OSSEC alert (coz mon is down) so they won't know to investigate, so app will likely get updated automatically too (and might fail in the same way if, say, there's a hardware compatibility issue).

If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway.

As an aside, I am very leery of trying to infer stuff from apt repo stats:

we should be committing to removing that kind of metadata where we can and
it would likely be unreliable.
it would likely not be useful for troubleshooting specific instances unless we get into inferring identities from IPs

legoktm commented 18 minutes ago

if admins are just letting the updates run automatically, and it fails on mon, they won't get an OSSEC alert (coz mon is down) so they won't know to investigate, so app will likely get updated automatically too (and might fail in the same way if, say, there's a hardware compatibility issue).

That's a really good point and seems like a good rationale to do app before mon. If app fails, we can send OSSEC alerts via mon, and either it's down so we notice, or we can display something in the JI to further flag it for journalists/admins.

If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway. <snip> But it still largely feels over-engineered for the automated case to me:

Which part do you think is over-engineered? Or: what would you want to do differently?

I think we have a different perspective/disagreement on how much we should be leaning into automatic vs manually driven? My current perspective is that we should be making the auto upgrade more robust/feasible/safe/etc., even at the risk of overdoing it, but it places the cost on us rather than administrators.

freedomofpress / securedrop

Figure out noble upgrade cadence plan #7333

Description