Integrate remainder of MGI pipeline into the GO pipeline

kltm commented 1 year ago

Project link

https://github.com/orgs/geneontology/projects/136

Project description

Currently, the GOC picks up MGI ortholog and upstream annotation data from MGI. The completion of this project would be that GOC directly pulls in this data, processes it, and adds it to the current data flow. This would remove MGI from the loop of directly processing MGI/mouse function data.

PI

Chris

Product owner (PO)

Li/Pascale

Technical lead (TL)

Sierra

Other personnel (OP)

Seth, Dustin, Anushya

Technical specs

While there is new software being written for this project, it is either 1) within the bounds of current technologies and practices or is 2) custom and one-off, not to be reused elsewhere. The needs of the project are described in great detail in the folders listed below; minimally meeting these requirements and rendering them into a pipeline is the entire scope of the project.

Other comments

This is a continuation of:

https://github.com/geneontology/project-management/issues/42 https://github.com/orgs/geneontology/projects/109

kltm commented 1 year ago

Letting @pgaudet and @ukemi know that this is seeded with likely personnel.

kltm commented 1 year ago

Possible order of operations

[ ] add mock pipeline and infrastructure
[ ] + rat orthology
https://docs.google.com/document/d/123o6GJ0lBwE7xUPM_LJXDJ-DoZeCN7Zh
[ ] + human orthology
[ ] + mouse annotation (uniprot upstreams)

kltm commented 1 year ago

TODO: add clarification for orthology source and how to process down to positive/negative list

kltm commented 1 year ago

Process documentation folder: https://drive.google.com/drive/folders/17O5e3gj_fkbSv2vscEYNIzpCNLIq3fG2

kltm commented 11 months ago

QC rounds folder with @ukemi and @sierra-moxon https://drive.google.com/drive/folders/1q_KNRV9iwCndS_tWlvYyx_hddUFfqiDJ

ukemi commented 11 months ago

QC rounds show that the Rat ISO load is done. @sierra-moxon will begin working on the human ISO annotations and @ukemi will begin QC on those. Once GPAD specs have been finalized, the GOC will begin providing test files for Lori to load into MGI.

kltm commented 11 months ago

New repository for this project at https://github.com/geneontology/gopreprocess

kltm commented 11 months ago

Noting for @pgaudet that we have hit a couple of slowdown points WRT needing to update some core software to support recent tooling (basically we need to start updating from some very old python versions). This will likely result in a small overhead increase for the project and draw in myself and @dustine32 for some tasks.

ukemi commented 8 months ago

Note that the rat and human ISO parts of the pipeline are close to completion and we have begun working on mouse annotations from Protein2GO. There is a rate limitation for the completion of this project that is tied to the GOC-wide conversion to the GPAD2.0 format and the generation of the GPAD2.0 files. There are also some issues to be discussed at the GOC-level:

Currently the filtering of 'duplicates' is not taking place at the GOC end. We need to put this into place not only for this project, but globally for the entire GOC.
Do we want annotations that do not map to mouse genes in MGI in the corpus of annotations resident at the GOC?
Do we want annotations from all of the IEA pipelines that are emitted from UniProt?

kltm commented 5 months ago

@pgaudet I had a long conversation with @sierra-moxon and have a feel for the position of the work. Basically, in a perfect world, it may be that all direct (i.e. ontobio) software work is done and all that's left is checking, making a GPAD/GPI 2.0 announcement, and running it into through the main pipeline. That said, this needs to be confirmed and running this through a pipeline that is a decent simulation of the final work is running into most of the same problems we run into when trying to do release pipeline stuff. To push through this, I'll be prioritizing pushing this through by whatever methods I can to land it on a "close enough" version of the final product so that we can do any final debugging and confirm the output with MGI. Once MGI has given that confirmation, it will be on us to make the final timeline and do the technical stuff. I've assigned myself https://github.com/geneontology/pipeline/issues/325.

pgaudet commented 5 months ago

It may be that the GPAD/GPI production would be better off as a separate project.

kltm commented 4 months ago

Talking to @pgaudet and @suzialeksander , next concrete steps are

[ ] produce and confirm output for current "quick" test pipeline runs
[ ] make sure that MGI gets access to these files (@sierra-moxon )
[ ] using these files, confirm that GPAD/GPI 2.0 look "good" for MGI (@sierra-moxon )
[ ] using these files, confirm that GPAD/GPI 2.0 look "good" for consortium (@pgaudet )
[ ] @pgaudet and @suzialeksander to send announcement that the format will be our primary output after date X/Y/Z
[ ] on date X/Y/Z, GPAD/GPI 2.0 code moved into the main pipeline branches (i.e. snapshot and release)
[ ] proceed.

(Note, if more work is needed on the MGI/QC side, we are likely to proceed with adding the code sooner anyways.)

kltm commented 4 months ago

@pgaudet Some of the development team took at look at the output from the test pipeline and there are some issues with the data that we want to pin down before passing the results on to MGI--mainly an increase in annotation in one file that we're having a little trouble tracing. This will mean 1) re-running some of the data (about half a day lag, assuming the pipelines are cooperative) and tweaking/checking a GPAD reprocessing step. We will be meeting again mid-week to see where we're at.

kltm commented 4 months ago

Also tagging @sierra-moxon and @dustine32

kltm commented 4 months ago

@pgaudet Changing PO to Li/Pascale

kltm commented 3 months ago

@pgaudet Talking to @sierra-moxon , the remainder of items in https://github.com/orgs/geneontology/projects/136 are MGI bookkeeping items , with all GO-driven items now moved or being re-created for https://github.com/orgs/geneontology/projects/155.

But these are still open items in our tracker. I think one way forward, to prevent confusion, would be to rename the project and project metadata to make clear that this is now an "MGI sub-project" and move it into the external collab category (i.e. no more "GO" resources, beyond communication, unless something bad happens).

pgaudet commented 2 months ago

@kltm Is https://github.com/geneontology/go-site/issues/2043 a MGI task?

kltm commented 2 months ago

@pgaudet Assuming no answer is needed as is now closed.

geneontology / project-management