The functional change here is increasing the memory allocation to the analysis Batch task to use the full RAM available on the i3.2xlarge instance we're using in the Batch compute environment. I had previously concluded that the jobs were only using about half of the 32GB we had allocated, but the biggest jobs have been crashing recently and it's possible memory is to blame. And there's no reason not to use all the memory on the instance.
I'm opening this as a draft PR because I already deployed the job definition update to production, so what's left is the changes I made in the course of spinning up my dev instance again and doing a limited deploy (I reused the containers, so only the services that have the job definition ID embedded in their parameters needed to change).
Here's what else is going on here:
There are separate job definition templates for staging and production, and historically staging used smaller instances and lower resource allocations. That probably made sense early in development, but I think at this point it makes more sense to make staging as similar to production as we can, so I updated all the staging parameters to match the production ones.
I added an 'update-tfvars' subcommand to infra, just for convenience.
The base image for the angularjs and tilegarden containers is old enough that its package directory no longer exists in the main Debian repo. For tilegarden, I was able to update the base image from stretch to buster and it worked fine. The angularjs container uses an older version of Node, and the Grunt build failed with the oldest Node version I could get with buster, so I left it on stretch and updated the apt repo to use http://archive.debian.org/debian.
Along the same lines (things being old): the Bower registry is still plugging along, but the bower install started throwing "certificate expired" errors. I think the certificate is actually fine and the problem is that it's based on a root certificate that the container doesn't have because it's too new. But I didn't go into it too much, just worked around it as recommended here.
Resolves #940 (hopefully)
Testing Instructions
When the time comes, this should be tested by:
Spinning up a dev instance and making sure the front end, API, and analysis (run by hand from the command logged by the API) work as usual.
Doing a staging deploy to make sure the provisioning parts are working.
Overview
The functional change here is increasing the memory allocation to the analysis Batch task to use the full RAM available on the i3.2xlarge instance we're using in the Batch compute environment. I had previously concluded that the jobs were only using about half of the 32GB we had allocated, but the biggest jobs have been crashing recently and it's possible memory is to blame. And there's no reason not to use all the memory on the instance.
I'm opening this as a draft PR because I already deployed the job definition update to production, so what's left is the changes I made in the course of spinning up my dev instance again and doing a limited deploy (I reused the containers, so only the services that have the job definition ID embedded in their parameters needed to change).
Here's what else is going on here:
infra
, just for convenience.angularjs
andtilegarden
containers is old enough that its package directory no longer exists in the main Debian repo. Fortilegarden
, I was able to update the base image fromstretch
tobuster
and it worked fine. Theangularjs
container uses an older version of Node, and the Grunt build failed with the oldest Node version I could get withbuster
, so I left it onstretch
and updated the apt repo to use http://archive.debian.org/debian.bower install
started throwing "certificate expired" errors. I think the certificate is actually fine and the problem is that it's based on a root certificate that the container doesn't have because it's too new. But I didn't go into it too much, just worked around it as recommended here.Resolves #940 (hopefully)
Testing Instructions
When the time comes, this should be tested by:
Checklist