Closed kltm closed 6 months ago
Some kind of "key problem"; now failing with:
21:25:19 + scp -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** ont-title.txt skyhook@skyhook.berkeleybop.org:/home/skyhook/issue-go-site-1530-summary-emails/reports/
21:25:19 /var/lib/jenkins/.ssh/config line 3: Unsupported option "rsaauthentication"
21:25:19 Permission denied, please try again.
21:25:19 Permission denied, please try again.
21:25:19 skyhook@skyhook.berkeleybop.org: Permission denied (publickey,password).
21:25:19 lost connection
...or maybe there was a quiet ssh update?
Technically, emails can be sent again (by removing anything that was having trouble); that said, I'm keeping this open until I can track down what changed and revert the reporting saves.
Okay, affecting all pipelines.
Okay, I've tracked the issue and it is not what I was expecting. Basically, some process has /wiped/ skyhook's home directory. This is either a manual error or one of the pipelines is setup incorrectly and is taking a swing at everything.
I think we reported this somewhere before, but I can't find the ticket. I think at the time I assumed a "manual" error; this time, given the timing, I'm fairly sure it's an issue in a Jenkinsfile.
Okay, my notes have it at 6 months ago on June 1st. That is sus. I'm going to rebuild skyhook and then start tracking files by their crontab.
Rebuilding skyhook.
I now have SOP notes for recovering the skyhook user/directory. For various TMI reasons, I'm going to keep those private for the moment. The machine has all recovery mechanisms chugging along; hopefully no more manual steps needed while resetting. Next: find the cause.
Nothing found in crontabs.
Pipelines that have run or tried to run recently:
go-ontology-dev
issue-35-neo-test
full-issue-325-gopreprocess
goa-copy-to-mirror
snapshot
issue-go-site-1530-summary-emails
release
...that's irritating as these run regularly with no issue.
Timing-wise, that leaves some questions.
Looking at go-ontology-dev
, it was successful with (Dec 31, 2023, 4:00 PM and failed with the "wiped" errors at (Jan 1, 2024, 12:00 AM). Technically speaking, 00:01:06 AM.
Just before that, we have an insta-fail on release
with
ERROR: Failed to clean the workspace
jenkins.util.io.CompositeIOException: Unable to delete '/var/lib/jenkins/workspace/neontology_pipeline_release-L3OLSRDNGI3ZIUODKFYUI4AO45X5C6RUGMOQAC5WV2Q6ZQOIFHMA'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
Note that this is before any stage. It's failing on the checkout attempt. Hm.
Okay, I have a theory.
Looking at the function
// Reset and initialize skyhook base.
void initialize() {
// Get a mount point ready
[..]
sh 'rm -r -f $WORKSPACE/mnt/$BRANCH_NAME || true'
[...]
What would happen if, somehow, $BRANCH_NAME was not defined. Somehow. This would have the effect of scouring skyhook. That should not be possible...but it is the only place where an "unprotected" delete occurs like that.
My theory is that the pipeline still managed to "run" enough to fail (unknown mechanism) but, since the pipeline had not run enough to define $BRANCH_NAME (let's posit that magic), but just enough to have code in place that an alternate thread (magic) managed to get to initialize()
; if that happened, skyhook would get toasted.
Testing on master
now.
Passed. Now propagating.
As of Jan 1st, summary emails are no longer sent on an error like:
Given the timing, my gut guess is that the key "expired" or something, as this has run like clockwork until now. That said, before digging in, I don't think we need to mount, right? What is that section?