I discovered two pertinent errors, both related to the garnet CI script ci.yml, whose "docker cleanup" step does three things
a) delete the docker container that was used to do the CI test;
b) look for other old/stale containers and delete them as well, by invoking a script docker-clean.sh;
c) delete all unused docker images older than 24 hours.
The first error was in step (c), which is supposed to delete unused docker images. Previously, it did this with a command docker image prune --force. The --force arg was supposed to prevent the command from asking "do you really want to do this?" Unfortunately, it had the side effect of deleting all docker images whether or not someone was actively using the image (i.e. an aha full regression).
Also, the "older than 24 hours" is not very useful, since that pertains to the time the image was built, not the time it was downloaded. Many of the images that we use are days or even weeks old.
So I got rid of the --force option and instead used the yes command to answer the prompt. And I bumped the until requirement from 24 to 72 hours, even though that really doesn't help much of anything.
The second error relates to the docker-clean.sh script from item (1b) above, which was killing all containers more than 4 hours old. Since the containers in question include aha full-regressions that take as much as 19 hours to complete, this was obviously a bad idea. So now the docker-clean script waits at least 5 days before deciding that a container needs deleting.
These changes are designed to address the aha regression-failure problem described in aha issue 1959 https://github.com/StanfordAHA/aha/issues/1959.
I discovered two pertinent errors, both related to the garnet CI script
ci.yml
, whose "docker cleanup" step does three thingsdocker-clean.sh
;The first error was in step (c), which is supposed to delete unused docker images. Previously, it did this with a command
docker image prune --force
. The--force
arg was supposed to prevent the command from asking "do you really want to do this?" Unfortunately, it had the side effect of deleting all docker images whether or not someone was actively using the image (i.e. an aha full regression).Also, the "older than 24 hours" is not very useful, since that pertains to the time the image was built, not the time it was downloaded. Many of the images that we use are days or even weeks old.
So I got rid of the
--force
option and instead used theyes
command to answer the prompt. And I bumped theuntil
requirement from 24 to 72 hours, even though that really doesn't help much of anything.The second error relates to the
docker-clean.sh
script from item (1b) above, which was killing all containers more than 4 hours old. Since the containers in question include aha full-regressions that take as much as 19 hours to complete, this was obviously a bad idea. So now the docker-clean script waits at least 5 days before deciding that a container needs deleting.