geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Add better sanitation/cleaning post ontology build and push #199

Open kltm opened 4 years ago

kltm commented 4 years ago

Recently, we've had a rash of ontology build failures around errors like:

    00:15:30  ERROR: Failed to clean the workspace
    00:15:30  jenkins.util.io.CompositeIOException: Unable to delete '/var/lib/jenkins/workspace/eontology_pipeline_snapshot-OLCOSBORX7TUJKUSVDZQNZDGXCRFSUQLNESYCP3R63U6FLW5DJ2A@2'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts. (Discarded 244 additional exceptions)
    00:15:30    at jenkins.util.io.PathRemover.forceRemoveDirectoryContents(PathRemover.java:90)
    [...]
    00:15:30  ERROR: Error cloning remote repo 'origin'
    00:15:30  hudson.plugins.git.GitException: Failed to delete workspace
    00:15:30    at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:745)
    00:15:30    at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1132)
    00:15:30    at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1177)
    00:15:30    at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:125)
    00:15:30    at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:93)
    00:15:30    at org.jenkinsci.plugins.workflow.steps.scm.SCMStep$StepExecutionImpl.run(SCMStep.java:80)
    00:15:30    at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
    [...]
    00:15:30    Suppressed: java.nio.file.FileSystemException: /var/lib/jenkins/workspace/eontology_pipeline_snapshot-OLCOSBORX7TUJKUSVDZQNZDGXCRFSUQLNESYCP3R63U6FLW5DJ2A@2/go-ontology/target/go.owl: Operation not permitted
    [...]
    00:15:30    Suppressed: java.nio.file.FileSystemException: /var/lib/jenkins/workspace/eontology_pipeline_snapshot-OLCOSBORX7TUJKUSVDZQNZDGXCRFSUQLNESYCP3R63U6FLW5DJ2A@2/go-ontology/target/external2go/wikipedia2go: Operation not permitted
    [...]

This is caused by the usual jenkins/docker permissions misalignment. Possible solutions are:

This kind of thing may indicate that a general in-Jenkins solution for #139 is not possible.

Also noting that it mostly seems to be having issue with the /target directory.

Tagging @balhoff

balhoff commented 4 years ago

@kltm at some point I think I will need a lot more information to contribute here. :-) Right now I don't understand what this has to do with the ontology makefile. Does this result from the ontology build running as root? https://github.com/geneontology/pipeline/blob/dac947b73cfc9cfd39c832ac0699def334522bde/Jenkinsfile#L382

In my Phenoscape build I took that out and it works much better for me, having all the files created and owned by the Jenkins agent.

kltm commented 4 years ago

@balhoff No worries--I just wanted to loop you in as we had talked about it before. I think that option "3" there is the likely way forward and there would be nothing for you to do for that.

Good point about the permissions there. IIRC, the reason I have that is that, without the root mapping there, there is a different set of permission strangenesses that come from an unknown user in the image creating files on the external filesystem. It might be worth it to explore that again though before setting off on the more complicated solutions. Added to the list.

matentzn commented 4 years ago

This problem annoys me so much.. I struggle a lot with this in monarch as well. I will keep listening here what solutions are found :)

kltm commented 4 years ago

@matentzn There are a few things I want to try, guaranteed to work, but they all operate at different levels of irritating complexity, coordination, or permission. The easiest/simplest things I've been looking at recently have been to try and ensure that there is no information leakage (like the solr build images), but in this case it's a little harder to control. My hope is to find a command that tells Jenkins not to mount the default workspace for the agent when running--that would solve pretty much everything. Still looking...

kltm commented 4 years ago

Noting recent rash of these for us.

kltm commented 4 years ago

Instance of on 2020-10-12 snapshot run (from inability to fully clean on 2020-10-11 run).

kltm commented 4 years ago

Instance of on 2020-10-24 snapshot run (likely from 24hr+ runtime due to resource strangle with other testing).

kltm commented 4 years ago

Instance of on 2020-10-30 snapshot run (unknown reason).

kltm commented 4 years ago

Instance of on 2020-11-17 snapshot run (likely from series of unfortunate events, including an upstream EBI FTP failure).

kltm commented 3 years ago

Instance of on 2020-12-17 snapshot run. Noting exactly a month from last time. Also, a fix went in a few days ago (although there were two good builds in between).

kltm commented 3 years ago

Instance of on 2020-12-19 snapshot run. It then "healed" itself, which I think might be a first...

kltm commented 3 years ago

Instance of this on 2020-01-09 snapshot run. Cleaning.

kltm commented 3 years ago

Instance of this on 2020-01-13 snapshot run. Cleaning.

kltm commented 3 years ago

Instance of this on 2020-01-15 snapshot run. Cleaning.

kltm commented 3 years ago

Occurrence on issue-noctua-models-170-zfin-import-test (tagging @sierra-moxon ).

kltm commented 3 years ago

I suspect that there are no further interesting patterns here, so I'm going to stop listing failures.

I'm thinking that it may be better to treat this as a sub-issue of #139 if we can get a non-root docker setup to correctly handle permissions.