geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

In some cases, the pipeline fails due to container not stopping fast enough #316

Open kltm opened 1 year ago

kltm commented 1 year ago

Recently, we've been running into a lot of failure where, when shutting down a docker container, the pipeline terminates with an error like:

java.io.IOException: Failed to kill container '34b9828b65b4fafd9e7753e6b361a0b1ceb31bef241ab43bd678c6d5e58e3049'.

I suspect that while we are in high memory or high usage scenarios, the gap between the SIGTERM and SIGKILL signals is not enough (https://docs.docker.com/engine/reference/commandline/stop/). This seems to be hardwired to one second pretty deep in the plugin: https://github.com/jenkinsci/docker-workflow-plugin/blob/d5d2e5c4007f7ea006152542b2bcbe0f1b2b08aa/src/main/java/org/jenkinsci/plugins/docker/workflow/client/DockerClient.java#L185

kltm commented 1 year ago

I don't believe there is likely anything we can easily do about this for the moment, except lean into the restarts and try to keep the machines at lower use when we need to get things through.

kltm commented 1 year ago

We've lost two recent snapshot builds to this again.

kltm commented 9 months ago

Feels like it's happening more often these days. I haven't crunched the numbers, but am recording still in my notes.