BOOKKEEPING: Retained for holding on to confinement-325 files until the MGI imports are done.

kltm commented 1 year ago

This ticket is for pipeline changes that need to occur for the "black box" of the "pre-MGI upstreams" to "post-MGI products" to be produced locally.

Tagging @sierra-moxon @ukemi

sierra-moxon commented 1 year ago

@kltm - there are two ways to go here:

1) gopreprocess just runs on its own (via docker with ansible/terraform as per usual, etc) in an EC2 machine somewhere, making its GAF files available in an S3 bucket (skyhook?). Then we adjust the pipeline/resource metadata to pick up these newly generated MGI automated annotation files from that new S3 bucket.

or

2) we try to incorporate the generation of these files into the pipeline (import the gopreprocess package into the pipeline runner and call the methods to produce GAF files via a bash script, pushing GAF files to skyhook intermediary location). We adjust the pipeline/resource metadata to pick up these newly generated MGI automated annotation files from skyhook.

Do you have a preference? (or a different idea?)

kltm commented 1 year ago

@sierra-moxon I would like to incorporate this end-to-end in the current pipeline "framework" if possible, so would like to explore "2" first. That said, the pipeline is happy to work with docker, virtualenv, or anything else we want to use, so patching it in shouldn't be bad. If you're ready, we could talk about this today (or tomorrow in the go software slot) and work out a plan.

kltm commented 1 year ago

Tagging @dustine32 for the above convo too.

ukemi commented 1 year ago

On today's project update meeting we decided that once the human and rat ISO annotations are in good shape, this ticket will be the next in the priority list. We will provide test files in GPAD2.0 for @leemdi to load into MGI. Then we will begin working on ticket #329.

sierra-moxon commented 1 year ago

some TODOs from experimenting today with download step:

[x] Reusing the pipeline's "download_source_gafs" method in go-site/scripts for downloading non-GAF files means really changing it a bunch, and I think it's better to just use the module in this repo to download and store files not already retrieved by the pipeline itself. I need to pull out the download routine into its own executable cli so that I can lump the downloading of files with other downloads in the pipeline, and save the conversion process for later in the pipeline when all the rest of the files are created (e.g. go.json) and/or downloaded (GPIs, GAFs, etc).
[x] Install the gopreprocess code and its dependencies on the pipeline server -- without docker here, I need to pull out the requirements from the pyproject.toml file and install locally via pip (trying to get poetry to work on the pipeline machine seems like a real complicated exercise). It probably will be better to execute this via docker w/re to python environment, but, it also makes it more of a black box from the pipeline POV and might be harder to debug.
[ ] add the ability to pass ENV variables to this gopreprocess cli interface (so that I can pass along the pipeline branch name into the dowload-config.yaml).

sierra-moxon commented 1 year ago

this is getting much closer, the test pipeline is:

[x] - downloading all needed files into upstream_and_raw_data/ (+new subdirs)
[x] - the gopreprocess code is incorporated into the Download stage and a new stage after ontology processing is done, to do the orthology conversion for human and rat and the GOA GAF file concatenation with the orthology conversion GAFs.

I've noticed that a simple concatenation of the GOA Protein2GO files with the converted files leads to a line count in the resulting GAF of over 700,000 lines. In the current "nonocuta" file from MGI, this is closer to 200,000 lines.

The human orthology conversion in the new code is about 104,000 lines (very close to the human line count in the MGI file), the rat orthology conversion in the new code is about 34,000 lines (very close to the rat line count MGI file). I imagine there is another set of requirements for weeding out duplicates from the GOA Protein2GO (mouse and mouse_isoform) files that we are not doing, yet. Need to touch base with @ukemi to confirm.

The next step is to figure out how to get a GPAD with all these mouse annotations out using existing software in the pipeline (in the correct/latest format). And then finally, to figure out how to modify the MGI go-site metadata to remove the upstream GAF from MGI and replace it with the GAF we generate here.

ukemi commented 1 year ago

@sierra-moxon This is great. It seems like a lot of duplicates. I would be surprised if there were a half million of them. I will try to take a look at this during the shoulder times of the meeting and we can look together tomorrow. The non-noctua file also contains the mouse annotations made directly from UniProt as well as our IEA annotations. So I'm surprised that your file is so much bigger. Maybe I am missing something.

ukemi commented 1 year ago

Hi @sierra-moxon, I just read your message. I think we will need to process the GOA Protein2GO files. I'm pretty sure that they contain all of the annotations from MGI, so as you said, those would be duplicates. When you process that file, I assume you are converting the UniProt mouse identifiers to MGI gene identifiers?

sierra-moxon commented 1 year ago

Nope! Currently I am not doing anything to the files. This is exactly what I needed, thank you David! So two req'ts: 1) remove the annotations where provided_by = MGI ? (lots are provided_by GO_Central - keep these?) 2) convert UniProt ids to MGI ids?

anything else?

ukemi commented 1 year ago

Took me a bit, but I found this. We cal take a look at it with @leemdi tomorrow. https://docs.google.com/document/d/1lJp_PAQ4517ADcJnU96YRLtUCRSKz7MXT2anov2Udvc/edit

ukemi commented 1 year ago

The last talk at the meeting just reminded me that we probably won't want to copy the above strategy exactly since we will want the IEA annotations to be part of the load. I believe that those are included in the GOA Protein2GO file, but am noting here that we need to remember to take a look tomorrow.

kltm commented 8 months ago

Current work headed at https://github.com/biolink/ontobio/pull/663

kltm commented 8 months ago

I think this is done.

sierra-moxon commented 7 months ago

This is open so that I can keep my noctua confinement files on skyhook for my issue-325 pipeline runs. purely technical book-keeping.

pgaudet commented 4 months ago

I think this can close? @sierra-moxon I thought you had closed/merged this issue-325 pipeline ?

sierra-moxon commented 4 months ago

I think Seth would like to keep it open until the MGI files produced in the new pipeline, make their way through snapshot and get signed off on.

pgaudet commented 4 months ago

Sounds good, thanks for the quick reply.

geneontology / pipeline

BOOKKEEPING: Retained for holding on to confinement-325 files until the MGI imports are done. #325