Feature/bryan/terraform upgrade

BryanQuigley commented 2 years ago

Overview

Upgrading terraform let's us move to Github Actions, use modern AWS providers/examples, and makes it generally easier to make changes.

Notes

Was hoping to improve terraform more then just required for upgrading, but it looks like that will have many other dependencies - FARGATE/docker version.

Note, every upgrade did recreate the container instances causing small bit of downtime :( I had a few theories as to why - not enough options specified in the config or the use of md5 to shorten names.. but none panned out to fix it.

Testing Instructions

Confirm I didn't break staging or at least one other dev environment.

Checklist

[X] Add entry to CHANGELOG.md

Half of #760

BryanQuigley commented 2 years ago

Just plan and apply is all I've been doing with the same GIT_COMMIT as is on staging. It wants to recreate them all anyway - generally trying to set certain AWS defaults back to null, but even when I tried adding them - it still wanted to recreate them. Example: - portMappings = [] -> null

You may have to delete any terraform folders or files in deployment/terraform if their permissions don't line up. (That's the error I just ran into.. but yours is a new git permissions feature. Trying to remember how I got around it.

BryanQuigley commented 2 years ago

Maybe it is just a permissions error - afaict I did not change any local or in-project git config. I did run into that error though.

KlaasH commented 2 years ago

Maybe it is just a permissions error - afaict I did not change any local or in-project git config. I did run into that error though.

How did you get around it? I didn't run plan on intermediate versions, so I don't know if it's only on the 1.1.9 container. But yeah, this is an unmodified container, so I don't see how it can be a local issue.

KlaasH commented 2 years ago

On the "every upgrade did recreate the container instances causing small bit of downtime" issue again: the way the batch job definitions get updated is kind of weird. infra plan runs the update_batch_definitions function, which always creates a new revision, which gets handed to terraform plan and set as the PFB_AWS_BATCH_ANALYSIS_JOB_DEFINITION_NAME_REVISION variable value in the task definition (see e.g. https://github.com/azavea/pfb-network-connectivity/blob/develop/deployment/terraform/task-definitions/app.json).

The comment in infra says "Only trigger this if the pfb-analysis container has changed", but there's no way to do that besides commenting out the update_batch_definitions call in the plan) section and replacing it with a hard-coded value (for the current task running on staging, that would be BATCH_ANALYSIS_JOB_NAME_REVISION="staging-pfb-analysis-run-job:761").

But if I do that and run scripts/infra plan (from host, where I'm not getting the crash mentioned above), it says "No changes. Your infrastructure matches the configuration." Whereas if I leave the update_batch_definitions step in, it says it needs to replace some of the service task definitions.

I have a vague memory that the way we're creating these new task revisions was influenced by there being limited Terraform support for Batch when we were first implementing it. But I didn't see any discussion of that in the issues about implementing it, so I could be off base, and I don't have a clear enough picture of exactly how it fits in to have a good sense of whether it should be something that we could bring into Terraform now and maybe base on the git commit ID to avoid unnecessary changes.

BryanQuigley commented 2 years ago

Yes, that's it - no changes required now. So for production upgrade I would need to

Comment out update_batch_definitions
export production equivalent of BATCH_ANALYSIS_JOB_NAME_REVISION="staging-pfb-analysis-run-job:761"

Then we should be at no (or minimal) downtime IIUC.

__ The last change I pushed to this branch fixes the git permissions/safety issue by having the terraform container run as whoever the local user is.

BryanQuigley commented 2 years ago

It was echo locator where I saw the permission/safe directory issue before. In that project I worked around it via adding: git config --global --add safe.directory /usr/local/src to the infra script itself.

azavea / pfb-network-connectivity