Open legoktm opened 6 days ago
Also, for instances with hands-on administrators, we can give them a heads up and let them manually run the migration script before our auto/forced migration.
(Early thoughts, not fully formed)
I like the idea of admins being in control of the migration, unless there's a situation where there's not a hands-on admin and we run up against the deadline.
What about an alternate approach that might look like:
0) We select a hard deadline date for the auto upgrade (for arguments sake, March 1st). Prior to March 1st, Admins can manually kick off the upgrade from an Admin Workstation. The mechanism for which might be:
1) We publish the package with the noble upgrade script on a new (temporary?) Apt server
2) We update securedrop-admin
on the Admin Workstations with a securedrop-admin noble-upgrade
command, which essentially adds the new Apt server to the sources list on both app
and mon
.
3) Admins can manually update before the deadline date
4) When the deadline happens, we promote the packages to the normal apt prod servers and "force" the update
5) We retire the temporary Apt server
is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong? If so, (building on @nathandyer's proposal) we kindof get that already if we give folks the option to migrate ahead of time and get the feedback of the first ones off the ice.
We don't need temporary apt servers or anything tho - we can ship the changes packaged as normal and just have EOL checks.
is the the idea behind phasing it to have some level of feedback as to how it's going and to not have a massive volume of support requests if things go wrong
Yes, and (if things go poorly) we shouldn't take down every single SecureDrop all at once.
To merge Nathan's proposal with mine:
./securedrop-admin noble-upgrade
until some set deadline. We recommend this, but don't require it.It looks like APT has built-in support for phased updates. Could that work for us, so we don't have to implement it ourselves? (This might be worth looking into generally.)
Thanks for flagging that, unfortunately focal's apt doesn't support phasing so it's not an option for us here, but it will become an option once we do upgrade to noble so let me file a separate task for that.
One point made in today's team meeting is that the admin-instigated upgrade period will give us a good sense of how robust the upgrade process is and inform how important spreading out the upgrades are.
Another thing I clarified is that the point of having mon go before app is so that we have a consistent state to test against. I don't want both servers upgrading at the same time, in a weird undefined/hard to test state. So one should go first, and then we upgrade the second. No strong opinion on whether it's app or mon, but that it's a defined order we can replicate during testing.
On Wed, Nov 13, 2024 at 12:31:42PM -0800, Kunal Mehta wrote:
Another thing I clarified is that the point of having mon go before app is so that we have a consistent state to test against. I don't want both servers upgrading at the same time, in a weird undefined/hard to test state. So one should go first, and then we upgrade the second. No strong opinion on whether it's app or mon, but that it's a defined order we can replicate during testing.
I agree, @legoktm. In any cloud deployment we would be able to stagger these, but our Ansible playbooks effectively run parallel to the Application and Monitor Servers at each step.
In the automatic scenario, how will we (and an administrator) know that a Monitor Server has been upgraded successfully? No "/metadata" endpoint to monitor there.
In any cloud deployment we would be able to stagger these, but our Ansible playbooks effectively run parallel to the Application and Monitor Servers at each step.
To clarify, even for the manual administrator-initiated upgrade, I would still want to do them in series (mon first, then app).
In the automatic scenario, how will we (and an administrator) know that a Monitor Server has been upgraded successfully? No "/metadata" endpoint to monitor there.
We/FPF will have no visibility into mon upgrades (maybe we can peek at apt-prod web request logs I guess).
For admins we'll send some sort of message via OSSEC alerts (i.e. logger.error("mon server has been upgraded")
).
Doing it for the manual updates is probably easier, you can just modify Ansible's inventory. Though there might be some refactoring necessary if roles depend on info shared between app and mon.
But it still largely feels over-engineered for the automated case to me:
mon
admins should shutdown app
anyway, so downtime would be unavoidable.mon
, they won't get an OSSEC alert (coz mon is down) so they won't know to investigate, so app
will likely get updated automatically too (and might fail in the same way if, say, there's a hardware compatibility issue).If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway.
As an aside, I am very leery of trying to infer stuff from apt repo stats:
if admins are just letting the updates run automatically, and it fails on
mon
, they won't get an OSSEC alert (coz mon is down) so they won't know to investigate, soapp
will likely get updated automatically too (and might fail in the same way if, say, there's a hardware compatibility issue).
That's a really good point and seems like a good rationale to do app before mon. If app fails, we can send OSSEC alerts via mon, and either it's down so we notice, or we can display something in the JI to further flag it for journalists/admins.
If we have effectively a single script for admins to run manually, and we push for those we're in contact to do so, we'll have a lot of data and chances to observe the script behaviour before the automated run anyway.
<snip>
But it still largely feels over-engineered for the automated case to me:
Which part do you think is over-engineered? Or: what would you want to do differently?
I think we have a different perspective/disagreement on how much we should be leaning into automatic vs manually driven? My current perspective is that we should be making the auto upgrade more robust/feasible/safe/etc., even at the risk of overdoing it, but it places the cost on us rather than administrators.
Description
Instead of upgrading every single instance at the exact same time (once we push a deb), I think it would be better to do some sort of staged rollout.
My proposal would be that on package upgrade, each instance generates a random number (1-5) and stores it somewhere. In theory we've now split all the securedrop servers into 5 groups.
Then, in another file we ship with the package (possibly the upgrade script itself) we have a number we control. if we set it to 1, we'll upgrade ~20% of servers. Then we can do another deb package release to bump it to 2 to upgrade ~40% of all servers. And so on.
I also think this mechanism should be split for both app and mon. We should upgrade all mon servers to 100% and then do all the app servers.