Closed BryanQuigley closed 1 year ago
Just plan and apply is all I've been doing with the same GIT_COMMIT as is on staging. It wants to recreate them all anyway - generally trying to set certain AWS defaults back to null, but even when I tried adding them - it still wanted to recreate them.
Example:
- portMappings = [] -> null
You may have to delete any terraform folders or files in deployment/terraform if their permissions don't line up. (That's the error I just ran into.. but yours is a new git permissions feature. Trying to remember how I got around it.
Maybe it is just a permissions error - afaict I did not change any local or in-project git config. I did run into that error though.
Maybe it is just a permissions error - afaict I did not change any local or in-project git config. I did run into that error though.
How did you get around it? I didn't run plan
on intermediate versions, so I don't know if it's only on the 1.1.9 container. But yeah, this is an unmodified container, so I don't see how it can be a local issue.
On the "every upgrade did recreate the container instances causing small bit of downtime" issue again: the way the batch job definitions get updated is kind of weird.
infra plan
runs the update_batch_definitions
function, which always creates a new revision, which gets handed to terraform plan
and set as the PFB_AWS_BATCH_ANALYSIS_JOB_DEFINITION_NAME_REVISION
variable value in the task definition (see e.g. https://github.com/azavea/pfb-network-connectivity/blob/develop/deployment/terraform/task-definitions/app.json).
The comment in infra
says "Only trigger this if the pfb-analysis container has changed", but there's no way to do that besides commenting out the update_batch_definitions
call in the plan)
section and replacing it with a hard-coded value (for the current task running on staging, that would be BATCH_ANALYSIS_JOB_NAME_REVISION="staging-pfb-analysis-run-job:761"
).
But if I do that and run scripts/infra plan
(from host, where I'm not getting the crash mentioned above), it says "No changes. Your infrastructure matches the configuration." Whereas if I leave the update_batch_definitions
step in, it says it needs to replace some of the service task definitions.
I have a vague memory that the way we're creating these new task revisions was influenced by there being limited Terraform support for Batch when we were first implementing it. But I didn't see any discussion of that in the issues about implementing it, so I could be off base, and I don't have a clear enough picture of exactly how it fits in to have a good sense of whether it should be something that we could bring into Terraform now and maybe base on the git commit ID to avoid unnecessary changes.
Yes, that's it - no changes required now. So for production upgrade I would need to
Then we should be at no (or minimal) downtime IIUC.
__ The last change I pushed to this branch fixes the git permissions/safety issue by having the terraform container run as whoever the local user is.
It was echo locator where I saw the permission/safe directory issue before. In that project I worked around it via adding: git config --global --add safe.directory /usr/local/src to the infra script itself.
Overview
Upgrading terraform let's us move to Github Actions, use modern AWS providers/examples, and makes it generally easier to make changes.
Notes
Was hoping to improve terraform more then just required for upgrading, but it looks like that will have many other dependencies - FARGATE/docker version.
Note, every upgrade did recreate the container instances causing small bit of downtime :( I had a few theories as to why - not enough options specified in the config or the use of md5 to shorten names.. but none panned out to fix it.
Testing Instructions
Checklist
Half of #760