PhilanthropyDataCommons / deploy

Deployment scripts for Philanthropy Data Commons service
GNU Affero General Public License v3.0
2 stars 2 forks source link

Migrate to DigitalOcean Apps #118

Closed slifty closed 2 weeks ago

slifty commented 1 month ago

We have a deployment system with a lot of good around it, but it's also bespoke / custom. This means there are some long term costs to it inherent in any custom system both in the form of opportunity cost as well as increased complexity / maintenance.

The idea of switching to DigitalOcean Apps has bubbled around for a little while now. Some reasons why this is a direction that we want to pursue.

Please note that I'm pulling some of these from other threads / emails / meeting notes over time and did not write them all myself! I do agree with these points, but don't want to take credit for @jasonaowen's thoughts!

Benefits of moving

  1. What we have today is well built but is also ad-hoc, which means we are responsible for maintaining, improving, and monitoring it. One example of where this manifests in reliability and team attention is that there are failure states we need to worry about, such as running out of disk space (which I believe have manifested).

  2. On-boarding is more difficult in general, as it requires project-specific information rather than using industry standard solutions. We don't on-board often, but I do think this point is a relevant signal of overall "mental load" that bespoke solutions can present to a team in general (even after on-boarding).

  3. There is a fair amount of tacit knowledge involved in running it, which means we rely on specific team members to know / remember how it works. This has been de-risked because there is documentation, and others could figure it out with enough time.

  4. A platform like DOAP provides a lot of peripheral benefit and tooling that would not make sense for a small team like us to build in a custom system, but we would benefit. For instance DOAP makes it easy to do things like find out what version was deployed, when, and why; see the status of deployments view deployment logs; roll back a bad patch; coordinate upgrades to prevent any downtime, etc.

  5. I'm going to pull out the downtime problem explicitly: my understanding the current deploy system means there is downtime every time we merge to main; this is a more meaningful problem. I'm sure we could build a solution, but that is engineering time that could be spent on product instead.

  6. We have to stay on top of OS dependency upgrades in the form of Docker container versions, as in PR deploy#115; This means there will be times that we are behind, and once again this is engineering attention that could be spent on product.

  7. Our backups are not as sophisticated as what DigitalOcean's managed PostgreSQL instances offers. In particular, DO offers automatic failover and point-and-click database restores.

Concerns around moving

One concern around using DOAP would be vendor lock in (particularly non-libre lock in). @jasonaowen has pointed out that generally it is very easy to switch to another platform as a service provider as they adhere to various industry standards in architecture. We are already following those standards (which is why switching to DOAP is not a heavy lift).

There was also a question about whether there is a free software alternative to DOAP and @jasonaowen has mentioned that solutions do exist, but we would not want the PDC team to be responsible for setting up and maintaining one (as opposed to the OTS infrastructure team). In the short term it seems that from a cost / benefit perspective DOAP makes more sense in the immediate term until OTS decides there is strategic merit to hosting a libre solution.

jasonaowen commented 1 month ago

Once PhilanthropyDataCommons/service#1236 and PhilanthropyDataCommons/front-end#852 have been merged, I intend to also migrate prod. The steps to do so are roughly:

  1. click on the Deploy to DO button in front-end, and configure it with the appropriate environment variables
  2. click on the Deploy to DO button in service, and configure it with the appropriate environment variables
  3. provision a managed database, and grant access to the api.pdco droplet
  4. backup and restore the database; roughly, pg_dump | psql - see https://github.com/PhilanthropyDataCommons/service/pull/1236#issuecomment-2392413690 for more details
  5. attach the managed database to the new app and adjust the environment variables accordingly
  6. contact @jmergy to update the DNS records to point to the new instances
  7. talk with the OTS infra team about the old droplets
bickelj commented 1 month ago

@jasonaowen @kfogel I wonder if we should keep infra notes in the traditional location for this migration as well as on GitHub. What do you think? It doesn't look like we kept DNS changes in the server updates notes for the droplets, so that's one question, but then changes to DO configurations are a separate matter.

jasonaowen commented 4 weeks ago

@kfogel says "yes" to @bickelj's question, and I will tag the infra team on the PRs.

jasonaowen commented 4 weeks ago

@smpsnr and @tris-ots, we're leaning towards having the PDC team own this infrastructure, as its configuration is almost entirely app-specific, and should require relatively little infra team maintenance. However, we want your opinion on this! How involved do you want to be? What do you feel like you need to know?

jasonaowen commented 4 weeks ago

I wonder if we should keep infra notes in the traditional location for this migration as well as on GitHub.

Documented in SVN in r23928.

tris-ots commented 4 weeks ago

However, we want your opinion on this! How involved do you want to be? What do you feel like you need to know?

In general, it seems like the transition to DO Apps should make life easier, with fewer machines for Sam and I to manage. (I'm guessing we're going to decommission the PDC droplets sometime soon?) My first thought is that the main role for us will be in performing whatever management tasks need to be done when the infra team is available and the PDC team isn't, like if something goes awry after east/central business hours - so what we'd need is to have a sense of what signals to respond to and how. @jasonaowen i imagine we'll talk about this at the infra checkin?

jasonaowen commented 3 weeks ago

In general, it seems like the transition to DO Apps should make life easier, with fewer machines for Sam and I to manage. (I'm guessing we're going to decommission the PDC droplets sometime soon?)

Yes, that is the plan!

My first thought is that the main role for us will be in performing whatever management tasks need to be done when the infra team is available and the PDC team isn't, like if something goes awry after east/central business hours - so what we'd need is to have a sense of what signals to respond to and how.

That is a good thought, and I'll try to figure out what management tasks there might be! Right now the only thing I can think of is needing to roll back / recover from a managed database failure.

@jasonaowen i imagine we'll talk about this at the infra checkin?

Yep, see you soon!

jasonaowen commented 3 weeks ago

I've created the new production database, migrated data from the old production database, put the old API into read-only mode, created new app instances for the front end and the API, and requested DNS changes from @jmergy. Once those DNS changes propagate, the only work left to do will be to decommission the old droplets and update the code in this repo to only touch Keycloak.

jmergy commented 3 weeks ago

I've requested the DNS changes per @jasonaowen and appreciate the benefits of the shift outlined from @slifty and team but am also concerned, as the team has referenced, the vendor lock concern vs benefit. We're leaning on OTS on this and have not be prescriptive on such elements of PDC so let's see how this goes.

jmergy commented 3 weeks ago

I've requested the DNS changes per @jasonaowen and appreciate the benefits of the shift outlined from @slifty and team but am also concerned, as the team has referenced, the vendor lock concern vs benefit. We're leaning on OTS on this and have not be prescriptive on such elements of PDC so let's see how this goes.

And @kfogel just addressed my concerns separately. All good. Thank you. I'll alert when DNS is updated.

bickelj commented 3 weeks ago

From the updated OP, I think 1, 5, and 7 are valid, 4 is partially valid but the examples of benefits are already present in the previous solution. The problems in 2, 3, and 6 are present with the DO solution.

I noticed Downloading and installing node 20.15.1.. in the most recent deployment, which is a downgrade from what was built and tested in the service repo. Perhaps this problem can be solved by specifying an exact node version in .node-version but I am not sure.

slifty commented 3 weeks ago

This is not related to DO Apps but in reference to the concern you just mentioned I am still unsure of the merits of specifying minor / patch versions of Node (this is the only project I've ever seen that specifies a version of node in tests (as opposed to a minimum version in engine). From what I've seen, a project would only specify a particular version if it relies on a feature from that version, and at that point it would not be a pin but rather would be a minimum defined int the package.json engine attribute.

For instance: https://github.com/remix-run/remix/blob/main/.nvmrc and https://github.com/remix-run/remix/blob/11c2acc6d901442e07934fd78f1098040f616382/package.json#L138

jasonaowen commented 3 weeks ago

I noticed Downloading and installing node 20.15.1.. in the most recent deployment, which is a downgrade from what was built and tested in the service repo. Perhaps this problem can be solved by specifying an exact node version in .node-version but I am not sure.

Good eye, @bickelj!

It appears to be because they're using an older version of heroku/heroku-buildpack-nodejs, v260, instead of the most recent version v266 which includes Node.js 20.18.0.

I opened a support ticket asking about their buildpack version policy, which CC'd infra-notices. I'll relay their answer here.

bickelj commented 3 weeks ago

@jasonaowen I saw the ticket and response from DO, thanks for following up with them.

The only other thing I see at the moment is it feels a bit sluggish to me. Requests seem to take over 400ms even after hitting the endpoint once or twice. About 100ms of that is my VPN, but I am not sure about the other (seemingly extra) 250ms. The good news is it seems to take about that same time regardless of the quantity of data in the response. Performance is not a big concern I just would not have expected a noticeable decrease. I guess the DB on a separate machine could account for some of that. In any case, not huge.

jasonaowen commented 3 weeks ago

That response, for posterity:

We appreciate you reaching out to us about build pack updates. In App platform we tend to be behind the LTS releases of build packs for additional testing and check before releasing to our systems. A bit more information on our entire upgrade policies can be found here: https://docs.digitalocean.com/products/app-platform/concepts/platform-upgrade-policy/

We do update our release notes when new build packs are pushed out as well. https://docs.digitalocean.com/release-notes/app-platform/

jasonaowen commented 3 weeks ago

Performance is not a big concern I just would not have expected a noticeable decrease.

We're also using the least expensive resources; if and when performance does become a concern, that's definitely a knob worth turning.

bickelj commented 3 weeks ago

I see in the August notes an interesting option, considering we have been producing an OCI-compatible image and publishing it to GHCR all along:

Build an image of your app using Docker and GitHub Container Registry, and then deploy the image to App Platform.

That easily would allow the repo to control which version of what dependency is used, just like before.

bickelj commented 3 weeks ago

@jasonaowen Is the deployment to production conditional on successful deployment to test? Or is it all rebuilt from scratch again and deployed to production regardless of deployment to test? I assume that to make it conditional we'd need the DNS changes to be complete.

bickelj commented 3 weeks ago

This is not related to DO Apps but in reference to the concern you just mentioned I am still unsure of the merits of specifying minor / patch versions of Node (this is the only project I've ever seen that specifies a version of node in tests (as opposed to a minimum version in engine). From what I've seen, a project would only specify a particular version if it relies on a feature from that version, and at that point it would not be a pin but rather would be a minimum defined int the package.json engine attribute.

For instance: https://github.com/remix-run/remix/blob/main/.nvmrc and https://github.com/remix-run/remix/blob/11c2acc6d901442e07934fd78f1098040f616382/package.json#L138

One of the key benefits of the previous system is precise versioning. To a large extent, if we tested with version x.y.z, we deployed with version x.y.z. I think testing exactly what you deploy, as far as you practically can, is good practice in any language. To your point, it usually works OK if we do not have such precision for major versions of language runtimes. If I tested an app with Python 3.9 and we upgrade the runtime to Python 3.11, no big deal, it's probably going to work fine, and so on for node apps. But in general, if we can pin exact versions of dependencies all the way down the dependency tree (yes, even into the operating system), we eliminate a whole lot of potential variation that can cause differences and issues in production.

jasonaowen commented 2 weeks ago

@jasonaowen Is the deployment to production conditional on successful deployment to test? Or is it all rebuilt from scratch again and deployed to production regardless of deployment to test? I assume that to make it conditional we'd need the DNS changes to be complete.

No, they are separate apps. Heroku offers such flexibility via their pipelines feature, but DigitalOcean does not appear to have something similar.

We could build out a PR preview feature, but I think that should be a follow-up task that we prioritize separately.

jasonaowen commented 2 weeks ago

I see in the August notes an interesting option, considering we have been producing an OCI-compatible image and publishing it to GHCR all along:

Build an image of your app using Docker and GitHub Container Registry, and then deploy the image to App Platform.

That easily would allow the repo to control which version of what dependency is used, just like before.

It could, but that puts more maintenance responsibility onto us, which I would prefer to avoid. I do not find the benefit of precisely specifying minor versions of the runtime to be worth the developer time, and am happy to delegate that to the platform.

... I am still unsure of the merits of specifying minor / patch versions of Node ...

One of the key benefits of the previous system is precise versioning. To a large extent, if we tested with version x.y.z, we deployed with version x.y.z. I think testing exactly what you deploy, as far as you practically can, is good practice in any language.

I agree in general, but I don't think that approach is a good fit for this project, for two reasons. First, we're not using any features on the bleeding edge (...of the LTS version), so it is very unlikely that we'll be affected one way or another by minor versions of Node.js; in fact, I don't think I've ever been affected by a non-security update to Node.js. Second, we have very limited dev time, which is what this migration is intended to address; spending it on runtime version maintenance does not seem like a wise investment to me.

tl;dr: I would rather delegate this responsibility to DOAP

To your point, it usually works OK if we do not have such precision for major versions of language runtimes. If I tested an app with Python 3.9 and we upgrade the runtime to Python 3.11, no big deal, it's probably going to work fine, and so on for node apps.

At the risk of well-actuallying you, this is actually a very different case! Python minor versions are equivalent to Node major versions; basically, Python 3 is a different language than Python 2, and so I do not consider the leading 3 to be part of a semver version number. There are major breaking changes between "minor" versions of Python, including removing deprecated modules.

A fairer comparison would be between, say, Python 3.13 and Node 22: both are release lines that have lifecycle policies around long term support and maintenance, and which should have no breaking changes within that release line.

jasonaowen commented 2 weeks ago

I've requested the DNS changes per @jasonaowen

These have all gone through! The production and test instances of both the front end and the back end are now hosted by DigitalOcean App Platform:

The next step is to start decommissioning the old continuous deployment automation.

jasonaowen commented 2 weeks ago

With PhilanthropyDataCommons/service#1265 merged, this repo is no longer receiving notifications about changes to PhilanthropyDataCommons/service, and so its GitHub action should not be automatically triggered. Although there is some cleanup left to do in this repo, this issue is complete.

New issue #120 Remove old deploy system New issue #121 Decommission droplets

bickelj commented 1 week ago

Jason said:

It could, but that puts more maintenance responsibility onto us, which I would prefer to avoid. I do not find the benefit of precisely specifying minor versions of the runtime to be worth the developer time, and am happy to delegate that to the platform.

I find the benefit of testing what we deploy and deploying what we test to be worth it. I could look up the amount of time spent specifying minor versions of the runtime but I predict it will take me longer to look that up than the amount of time spent doing several of those updates.

I agree in general, but I don't think that approach is a good fit for this project, for two reasons. First, we're not using any features on the bleeding edge (...of the LTS version), so it is very unlikely that we'll be affected one way or another by minor versions of Node.js; in fact, I don't think I've ever been affected by a non-security update to Node.js. Second, we have very limited dev time, which is what this migration is intended to address; spending it on runtime version maintenance does not seem like a wise investment to me.

tl;dr: I would rather delegate this responsibility to DOAP

Your argument here is essentially "node is different: we probably won't be affected by changes to the runtime." I do not have the same amount of experience with node in particular, but my experience with Java is similar. My approach there is still the same. We use a package-lock.json for good reason and it is the same reason to lock a version of node: any change is a potential breaking change.

To your point, it usually works OK if we do not have such precision for major versions of language runtimes. If I tested an app with Python 3.9 and we upgrade the runtime to Python 3.11, no big deal, it's probably going to work fine, and so on for node apps.

At the risk of well-actuallying you, this is actually a very different case! Python minor versions are equivalent to Node major versions; basically, Python 3 is a different language than Python 2, and so I do not consider the leading 3 to be part of a semver version number. There are major breaking changes between "minor" versions of Python, including removing deprecated modules.

A fairer comparison would be between, say, Python 3.13 and Node 22: both are release lines that have lifecycle policies around long term support and maintenance, and which should have no breaking changes within that release line.

I think you highlight another important point: different products use different versioning schemes. Rather than spending time figuring out the reliability of each and every particular package, runtime, dependency, etc., it is simpler to follow one rule. What is the rule? Test what you deploy and deploy what you test, as far as can be practically done.

It sounds like you agree with this rule in general with the exception of the Node runtime. Digital Ocean seems to have had some bad experience in the past with changes to the Node version otherwise they probably would be up to date on Node. In other words, they probably disagree with you on Node.