geneontology / project-management

Tracking project metadata in the GO as issues.
2 stars 0 forks source link

SGD - Noctua migration #34

Closed pgaudet closed 6 months ago

pgaudet commented 1 year ago
Project link

https://github.com/orgs/geneontology/projects/61

Project description

Tasks needed to complete the migration of manual SGD annotations from Protein2GO to Noctua for full adoption of Noctua as SGD GO curation tool.

PI

Mike

Project owner (PO)

@suzialeksander

Technical lead (TL)

@dustine32

Other personnel (OP)

TBD

Technical specs

Using current system

Other comments

Code changes will likely be in https://github.com/biolink/ontobio; tickets about bugs/code will be here.

SGD models will be isolated in https://github.com/geneontology/sgd-go-cams; tickets about project progress will be here.

pgaudet commented 1 year ago

Update on managers call:

pgaudet commented 1 year ago

Was delayed because of GPI issue (now fixed)

pgaudet commented 1 year ago

Managers call: @suzialeksander says data not yet loaded

pgaudet commented 1 year ago

@suzialeksander Can you add the current state of the project here please? ie remaining work and planned delivery date

Thanks, Pascale

suzialeksander commented 1 year ago

These models are in Noctua Dev, and SGD is testing. Hopefully, if no major issues, we can ask to have these loaded to prod quite soon (couple weeks?). Other remaining work includes switching pipelines at SGD/GO to not scoop QuickGO anymore and make sure we aren't eating tails anywhere.

pgaudet commented 1 year ago
srengel commented 9 months ago

SGD is ready for this project to move forward.

steps to get into prod:

  1. fresh dump from Alex of SGD P2GO content, then delete the annotations at P2GO or "freeze" by marking as archived (so that they are not used by anyone since once loaded into Noctua, the P2GO annotations with source=SGD will now be duplicates of what is in Noctua)
  2. Dustin (?) load the fresh set of annotations into Noctua prod
  3. at SGD, we will continue our pipeline 'as is' to gather annotations and generate necessary files. (only difference is that the P2GO load into SGD pipeline will no longer have any source=SGD annotations.)
pgaudet commented 9 months ago

Thanks @srengel !

Just to check:

srengel commented 9 months ago

Hi @pgaudet

FYI, we currently have GO annotations in SGD from these sources: GO_Central GOC InterPro RNAcentral SGD UniProt ComplexPortal RHEA

suzialeksander commented 8 months ago

Update, SGD is ready for a new file. Pending production of the new file by @alexsign, we should be able to load a new file next week with help from @dustine32. There are a few pipeline tweaks and data checks for both P2GO and SGD after that.

suzialeksander commented 8 months ago

There is some documentation at https://docs.google.com/document/d/1PZH2SiyF9FJhvW_M_cr3GlReSZfvbj96AkFHs9DD6Qc. To expand the process:

  1. Alex generated a GPAD of all annotations with source=SGD.
  1. Load this file into Noctua
  2. SGD curators compare the Noctua load (in Dev or Prod) to the annotations in P2GO. Filter in Noctua by Title: SGD:*
  3. P2GO checks that no annotations were lost, and the few they needed to retain (not to Sc) are still in P2GO.
  4. P2GO marks all Sc annotations assigned_by:SGD as status=archived
  5. SGD moves the Noctua GPAD as the priority load at SGD, P2GO continues to produce the "remainders" file of IEAs, non-SGD sources (ComplexPortal, etc.)

in progress

kltm commented 8 months ago

@suzialeksander As we get closer to the end of this, it would be good to work out a final timeline for these steps to make sure that we're not causing any double-ups or gaps anywhere.

@vanaukenk It might also be interesting to see what the profile of this import set is when viewed through the lens of @balhoff 's recent tooling for https://github.com/geneontology/go-shapes/issues/306 . It might help contextualize some choices before final commitments are made.

suzialeksander commented 8 months ago

Final (?) planning call today: agenda in Shared Drive

Slides with current yeast dataflow, and nearly identical flow post-project

Next steps:

Feb 1:

Upon successful snapshot containing above models:

suzialeksander commented 8 months ago

Also, SGD is waiting for the remainders file that @alexsign is working on.

suzialeksander commented 8 months ago

After the outage, the models seem to have landed as intended. Success!!

However, a tiny issue emerged during spot checking: two curators were assigned the same ORCID when converting the files for loading, ~647 models out of the 7075 loaded.

Next steps:

alexsign commented 8 months ago

@suzialeksander remainders file available now. same name, same place. please take a look and let me know if it's good.

kltm commented 8 months ago

Noting that https://github.com/geneontology/project-management/issues/34#issuecomment-1922756579 has changed given recent discussions: we will essentially be doing a full clobber with the expectation that we're essentially doing a re-run of yesterday (as SGD is still in their curation freeze).

suzialeksander commented 7 months ago

Update from the 8 Feb Noctua outage/load:

Spot checking has revealed some extraneous inferred annotations, specifically "reproductive process" from

cellular response to pheromone PMID:12446563 IMP 20231004 SGD part_of(conjugation with cellular fusion)

The immediate actions are:

vanaukenk commented 7 months ago

Just to clarify in advance of today's call - I can't speak directly for them, but I doubt that MGI would also want redundant ancestor/child annotations with the same evidence code from the same paper.

There may, however, be other inferred annotations that they do want.

One other option we've considered for the GPAD output is to give inferred annotations their own evidence code so they could more readily be filtered if groups do not want them.

That said, it would still be nice to create useful inferences wherever possible.

suzialeksander commented 7 months ago

After today's call, @kltm and @cmungall will look into diff'ing the terms and seeing if adding do_not_annotate or similar tags on terms will help, as it's likely most of the inferred annotations are to a handful of terms with a long tail.

Managers agreed that dealing with inferred annotaitons is really a separate project from the import, and further work in this new project would include giving these inferred annotations a more accurate EC than implying the curator made these inferred annotations directly. Inferred annotation situation is analogous to when GO inferred BP-MF annotations, then backtracked.

kltm commented 7 months ago

Noting that the diff/exploration has a ticket here: https://github.com/geneontology/noctua-models/issues/271

suzialeksander commented 7 months ago

After spot checking some models, there are ShEx violations in several models- incorrect relations for the terms, tec. Waiting on the violation report to see a full list, but the ones that have come up are individually fixable so far.

As for the inferences, it seems these might be fixable though ontology improvements.

Still waiting on a release to make sure the entire cycle SGD-GO works, but starting to test the snapshot that just came out.

suzialeksander commented 6 months ago

Discussing with @pgaudet The GO part of this work is now finished. Remaining tickets are work for the SGD curators to fix annotations. We will close this project, and open a new one specifically for SGD tasks & cleanup of annotations that are in Noctua

srengel commented 6 months ago

@alexsign Please start Noctua import. After it’s done, please cross check annotations and delete old ones. Then please make NoctuaSGD public. (this is our understanding of remaining steps for this project. please correct us if this is wrong.)

for reference, this is Suzi's email from last Thursday Mar28:

Hi Alex,

Thanks for these files. We've looked at them, specifically the P2GO_not_in_Noctua, and it looks like these are left out mostly due to being not yeast, or not in the protein-centric world (this is expected, lots of RNAs and such). The lastest Snapshot is 2024-03-21, and Pascale and I cleared it for SGD annotations this week although it doesn't have a lot of our latest edits to save annotations that failed the import. I think everything looks good for you to proceed with Step 4, deleting SGD data from P2GO & make NoctuaSGD public.

Thanks Suzi