geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Consider (and attempt) blessing snapshots runs to "release" status #352

Open kltm opened 5 months ago

kltm commented 5 months ago

Look at blessing snapshots to release, to:

No new libraries or technologies. The only "interesting" additions would likely be:

kltm commented 4 months ago

Noting that we have a week-long holding pen for snapshots already built in for debugging, during the "Publish" step. If I switch over to having these autoclean by bucket policy, these would give us a clean jump-off point to perform the manual publication that we're already doing because of the Zenodo instability. This holding pen could be arbitrarily extended up from a week to however long we want.

While this very much falls short of a full after-the-fact "blessing" system, it is actually very in line with current practices and I believe that with the change of a few lines of the current manual release SOP, we could bring up a successful snapshot.

@pgaudet What are the minimum indicators you need before knowing if a snapshot is worthwhile? Would you be able to look at the stats and, if it looks okay, let me know and I could put it out on the experimental AmiGO so you could take a closer look? How would letting you know work? Could I just sign you up for all success snapshot run emails and you get back to me when the timing feels right? If this kind of thing might work for you, I think I have a fairly quick way forward:

kltm commented 4 months ago

7-day existence rule added; we should see results very soon.

kltm commented 4 months ago

The dailies now auto-clean. Moving forward, we can use these as a clean base, within a week, to create a release.

pgaudet commented 3 months ago

@kltm

What are the minimum indicators you need before knowing if a snapshot is worthwhile? Would you be able to look at the stats and, if it looks okay, let me know and I could put it out on the experimental AmiGO so you could take a closer look? How would letting you know work? Could I just sign you up for all success snapshot run emails and you get back to me when the timing feels right?

The same procedure as we have now for the release seems appropriate:

  1. I get a notification that a release/snapshot is ready to be checked. Note that having the data on some experimental AmiGO is required for the checks to be carried out.
  2. I look at the stats, and if all is OK, I notify you. Right now this communication is by email; we can change that if needed.

Does that answer all the questions?

Thanks, Pascale

kltm commented 3 months ago

Talking to @pgaudet this morning, until we've run through this a couple of times to work out the kinks (or have a machine that gets us back to where we were), we'll:

kltm commented 3 months ago

Okay, after a little consideration, I think I may have some "easy" ways forward, although any one might take a day or so to put together. Essentially, the issue is with a bad docker/jenkins interaction. I can now see a few ways to bypass this:

  1. break the pipeline into two pieces, pre-index and post-index, and do the middle part (essentially) manually. while labor-intensive, this is nearly guaranteed to be tractable
  2. set a pipeline (snapshot) to use a single standing docker instance to build the index. possible issue here is that "remote controlling" docker may be a big PITA, but we bypass the interaction bug and we still have full automation
  3. break the solr load into smaller pieces that should individually not have the footprint to stop things. I think this would likely work, but would be slow to test
kltm commented 3 months ago

Actually, poking around in this, I think I'm going to try something else first:

  1. "catch" the error, wait, and then continue; going to take a look at the Jenkins docs but, IIRC, this is supported
kltm commented 3 months ago

Also, clarifying for "3", to make this work, the whole image would have to be dropped and stood back up. If going that way, there will be some temporary repetition and we may have to introduce a template functions to bypass the string limit we will almost immediately smack into.

kltm commented 3 months ago

Looking at the failure messages, and understanding how this is happening at a stage level (not a step level), I think I can change tack a little. I've created a new pipeline snapshot-post-fail; it has the following properties

I believe what this should allow me to do is "hijack" the snapshot run with the new pipeline, picking up where the failed (but data-wise sound) run terminated.

kltm commented 3 months ago

https://github.com/geneontology/pipeline/blob/snapshot-post-fail/Jenkinsfile

kltm commented 3 months ago

Cheers to @dustine32 for helping me out with a code review. Issues that I'll fix before proceeding:

kltm commented 3 months ago

@pgaudet I believe a snapshot has now gone through, using the modified pipeline. Would you be able to briefly review it? If it seems solid, we can either 1) attempt to do the new "promotion" procedure, where we try and take a snapshot and make it a release or 2) do the same thing we did here for release, giving us a very very high probability of success.

kltm commented 3 months ago

Noting that I'm now working towards something between the two above. Essentially, I will be taking the release pipeline, removing the first part of it, and replacing it with a "copy from snapshot". We can refine this model and timing, but a huge improvement over what we have now (nothing). (@dustine32 I'll be hunting after you in the next day or so for a review of that change and as a sanity check.)