geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

Closed kltm closed 6 months ago

kltm commented 6 months ago

As of Jan 1st, summary emails are no longer sent on an error like:

18:29:52  + sshfs -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** -o idmap=user skyhook@skyhook.berkeleybop.org:/home/skyhook /var/lib/jenkins/workspace/ssue-go-site-1530-summary-emails/mnt/
18:29:52  read: Connection reset by peer

Given the timing, my gut guess is that the key "expired" or something, as this has run like clockwork until now. That said, before digging in, I don't think we need to mount, right? What is that section?

kltm commented 6 months ago

Some kind of "key problem"; now failing with:

21:25:19  + scp -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** ont-title.txt skyhook@skyhook.berkeleybop.org:/home/skyhook/issue-go-site-1530-summary-emails/reports/
21:25:19  /var/lib/jenkins/.ssh/config line 3: Unsupported option "rsaauthentication"
21:25:19  Permission denied, please try again.
21:25:19  Permission denied, please try again.
21:25:19  skyhook@skyhook.berkeleybop.org: Permission denied (publickey,password).
21:25:19  lost connection

...or maybe there was a quiet ssh update?

kltm commented 6 months ago

Technically, emails can be sent again (by removing anything that was having trouble); that said, I'm keeping this open until I can track down what changed and revert the reporting saves.

kltm commented 6 months ago

Okay, affecting all pipelines.

kltm commented 6 months ago

Okay, I've tracked the issue and it is not what I was expecting. Basically, some process has /wiped/ skyhook's home directory. This is either a manual error or one of the pipelines is setup incorrectly and is taking a swing at everything.

I think we reported this somewhere before, but I can't find the ticket. I think at the time I assumed a "manual" error; this time, given the timing, I'm fairly sure it's an issue in a Jenkinsfile.

Okay, my notes have it at 6 months ago on June 1st. That is sus. I'm going to rebuild skyhook and then start tracking files by their crontab.

Rebuilding skyhook.

kltm commented 6 months ago

I now have SOP notes for recovering the skyhook user/directory. For various TMI reasons, I'm going to keep those private for the moment. The machine has all recovery mechanisms chugging along; hopefully no more manual steps needed while resetting. Next: find the cause.

kltm commented 6 months ago

Nothing found in crontabs. Pipelines that have run or tried to run recently: go-ontology-dev issue-35-neo-test full-issue-325-gopreprocess goa-copy-to-mirror snapshot issue-go-site-1530-summary-emails release ...that's irritating as these run regularly with no issue.

kltm commented 6 months ago

Timing-wise, that leaves some questions. Looking at go-ontology-dev, it was successful with (Dec 31, 2023, 4:00 PM and failed with the "wiped" errors at (Jan 1, 2024, 12:00 AM). Technically speaking, 00:01:06 AM. Just before that, we have an insta-fail on release with

ERROR: Failed to clean the workspace
jenkins.util.io.CompositeIOException: Unable to delete '/var/lib/jenkins/workspace/neontology_pipeline_release-L3OLSRDNGI3ZIUODKFYUI4AO45X5C6RUGMOQAC5WV2Q6ZQOIFHMA'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.

Note that this is before any stage. It's failing on the checkout attempt. Hm.

kltm commented 6 months ago

Okay, I have a theory.

Looking at the function

// Reset and initialize skyhook base.
void initialize() {
    // Get a mount point ready
[..]
    sh 'rm -r -f $WORKSPACE/mnt/$BRANCH_NAME || true'
[...]

What would happen if, somehow, $BRANCH_NAME was not defined. Somehow. This would have the effect of scouring skyhook. That should not be possible...but it is the only place where an "unprotected" delete occurs like that.

My theory is that the pipeline still managed to "run" enough to fail (unknown mechanism) but, since the pipeline had not run enough to define $BRANCH_NAME (let's posit that magic), but just enough to have code in place that an alternate thread (magic) managed to get to initialize(); if that happened, skyhook would get toasted.

kltm commented 6 months ago

Testing on master now.

kltm commented 6 months ago

Passed. Now propagating.