Open kltm opened 1 year ago
@kltm - there are two ways to go here:
1) gopreprocess just runs on its own (via docker with ansible/terraform as per usual, etc) in an EC2 machine somewhere, making its GAF files available in an S3 bucket (skyhook?). Then we adjust the pipeline/resource metadata to pick up these newly generated MGI automated annotation files from that new S3 bucket.
or
2) we try to incorporate the generation of these files into the pipeline (import the gopreprocess package into the pipeline runner and call the methods to produce GAF files via a bash script, pushing GAF files to skyhook intermediary location). We adjust the pipeline/resource metadata to pick up these newly generated MGI automated annotation files from skyhook.
Do you have a preference? (or a different idea?)
@sierra-moxon I would like to incorporate this end-to-end in the current pipeline "framework" if possible, so would like to explore "2" first. That said, the pipeline is happy to work with docker, virtualenv, or anything else we want to use, so patching it in shouldn't be bad. If you're ready, we could talk about this today (or tomorrow in the go software slot) and work out a plan.
Tagging @dustine32 for the above convo too.
On today's project update meeting we decided that once the human and rat ISO annotations are in good shape, this ticket will be the next in the priority list. We will provide test files in GPAD2.0 for @leemdi to load into MGI. Then we will begin working on ticket #329.
some TODOs from experimenting today with download step:
this is getting much closer, the test pipeline is:
I've noticed that a simple concatenation of the GOA Protein2GO files with the converted files leads to a line count in the resulting GAF of over 700,000 lines. In the current "nonocuta" file from MGI, this is closer to 200,000 lines.
The human orthology conversion in the new code is about 104,000 lines (very close to the human line count in the MGI file), the rat orthology conversion in the new code is about 34,000 lines (very close to the rat line count MGI file). I imagine there is another set of requirements for weeding out duplicates from the GOA Protein2GO (mouse and mouse_isoform) files that we are not doing, yet. Need to touch base with @ukemi to confirm.
The next step is to figure out how to get a GPAD with all these mouse annotations out using existing software in the pipeline (in the correct/latest format). And then finally, to figure out how to modify the MGI go-site metadata to remove the upstream GAF from MGI and replace it with the GAF we generate here.
@sierra-moxon This is great. It seems like a lot of duplicates. I would be surprised if there were a half million of them. I will try to take a look at this during the shoulder times of the meeting and we can look together tomorrow. The non-noctua file also contains the mouse annotations made directly from UniProt as well as our IEA annotations. So I'm surprised that your file is so much bigger. Maybe I am missing something.
Hi @sierra-moxon, I just read your message. I think we will need to process the GOA Protein2GO files. I'm pretty sure that they contain all of the annotations from MGI, so as you said, those would be duplicates. When you process that file, I assume you are converting the UniProt mouse identifiers to MGI gene identifiers?
Nope! Currently I am not doing anything to the files. This is exactly what I needed, thank you David! So two req'ts: 1) remove the annotations where provided_by = MGI ? (lots are provided_by GO_Central - keep these?) 2) convert UniProt ids to MGI ids?
anything else?
Took me a bit, but I found this. We cal take a look at it with @leemdi tomorrow. https://docs.google.com/document/d/1lJp_PAQ4517ADcJnU96YRLtUCRSKz7MXT2anov2Udvc/edit
The last talk at the meeting just reminded me that we probably won't want to copy the above strategy exactly since we will want the IEA annotations to be part of the load. I believe that those are included in the GOA Protein2GO file, but am noting here that we need to remember to take a look tomorrow.
Current work headed at https://github.com/biolink/ontobio/pull/663
I think this is done.
This is open so that I can keep my noctua confinement files on skyhook for my issue-325 pipeline runs. purely technical book-keeping.
I think this can close? @sierra-moxon I thought you had closed/merged this issue-325 pipeline ?
I think Seth would like to keep it open until the MGI files produced in the new pipeline, make their way through snapshot and get signed off on.
Sounds good, thanks for the quick reply.
This ticket is for pipeline changes that need to occur for the "black box" of the "pre-MGI upstreams" to "post-MGI products" to be produced locally.
Tagging @sierra-moxon @ukemi