Upload MGI upstream / "silver" to mirror.geneontology.io, with new filename, and point metadata to it

kltm commented 3 months ago

Currently, the build process depends on quirks of skyhook. To make this generally usable, we want to upload the MGI upstream file we produce to a stable location (mirror.geneontology.io), with the new filename, and point metadata to it.

From original https://github.com/geneontology/gopreprocess/issues/65

[x] change the name to mgi-p2go-homology.gaf
[x] push to the bucket - only have to worry about updates, no deletes necessary. file is not versioned.
[x] rename silver-issue-325-gopreprocess pipeline to something more intuitive
[x] automate the upload to the S3 bucket

go look at the go-copy-to-mirror pipeline branch for finding the S3 bucket.

kltm commented 2 months ago

From @sierra-moxon

this is the current "upstream" for MGI: http://skyhook.berkeleybop.org/silver-issue-325-gopreprocess/products/upstream_and_raw_data/preprocess_raw_files/mgi-merged.gaf

kltm commented 2 months ago

Now available at: https://mirror.geneontology.io/mgi-p2go-homology.gaf https://mirror.geneontology.io/mgi-p2go-homology.gaf.gz

kltm commented 2 months ago

go-site metadata updated in mgi.yaml.

sierra-moxon commented 2 months ago

I made a new branch off of the silver-issue-325-gopreprocess pipeline branch called: p2go-homology-upstream-file-generator. This new branch adds a step to include two new subdirectories and a copy of the final GAF file from the upstreams code base to s3://go-mirror/:

p2go-homology-upstream-file-generator/preprocess_raw_files/
p2go-homology-upstream-file-generator/preprocessed_GAF_output/
at the root level, s3://go-mirror/mgi-p2go-homology.gaf.gz is added/overwritten on every successful run of this pipeline branch. This is the MGI upstream now. Seth already changed the go-site metadata to reflect this new name/path.

These capture the incremental output of the upstreams code as well as the final GAF file. Each command in the new pipeline branch overwrites the last run's files in the paths above. I looked a tiny bit into versioning; @kltm - do we need to keep versions of this file or the pipeline outputs?

I pushed this branch, and it will try to run on the next repository scan.

kltm commented 2 months ago

@sierra-moxon A quick note that we need the compressed version of the file.

sierra-moxon commented 2 months ago

fixed to use .gz version of the file.

kltm commented 2 months ago

@sierra-moxon Sorry to ask, but I don't think the current production metadata points to this yet? Perhaps we should at an item to the top, just so this can be tracked?

kltm commented 2 months ago

Or maybe that's https://github.com/geneontology/go-site/issues/2285 ...in which case I'll put things back the way you had them :)

sierra-moxon commented 2 months ago

yes, that one https://github.com/geneontology/go-site/issues/2285 should be the one we use to merge metadata changes in, I have the MGI metadata changes in this branch (where we point to the mirror version of the gopreprocess MGI gaf file, etc). This branch also has a lot of hacking in it to make my pipeline go fast. So I will cherry pick changes into a new branch for merge into master/main.

geneontology / pipeline

Upload MGI upstream / "silver" to mirror.geneontology.io, with new filename, and point metadata to it #369