geneontology / syngo-go-cams

A set of GO-CAMs built from SynGO data
0 stars 0 forks source link

Determine testing plan for each SynGO load #3

Open dustine32 opened 2 years ago

dustine32 commented 2 years ago

This ticket is for defining a testing/release plan for any new SynGO JSON-to-TTL conversion load. What stats should we track? Who should test? What specifically should they test (e.g. check conversion stats, "look" at GO-CAMs in Noctua, check GPAD output)?

Input: JSON file from https://github.com/geneontology/syngo2lego_data_conversion Output: TTL files (e.g. SYNGO_1234.ttl)

Some stats:

  1. JSON object count (input)
  2. Model count (input)
  3. Distinct ECO codes count (input)
  4. ECO code instance count (input)
  5. TTL file count (output)

For the conversion load in #2 , here are the stats:

JSON object count: 3242
Model count: 3242
Distinct ECO codes count: 35
ECO code instance count: 10795
TTL file count: 3241

(Looks like one model didn't get converted to TTL: SYNGO:1805 - RGD:621347 GO:0099090)

Downstream effects should also be considered though that can probably be handled/tracked in a different repo. For instance, the "ECO code instance count" might closely correlate with the total number of contributor=SynGO annotations produced by the GO pipeline.

Tagging @thomaspd @vanaukenk @pgaudet @kltm

Feel free to move this ticket to a different repo if that makes sense!

kltm commented 2 years ago

Cheers! Interesting that one didn't make it--I didn't expect that.

We should probably sort this into the right project as owner and specs are assembled.

thomaspd commented 2 years ago

Thanks Dustin. Based the discussion at the managers meeting and my discussion with Dustin today, I'll be the project owner, and Dustin the tech lead. We discussed a process for SynGO loads with the following steps (Dustin please correct whatever I got wrong or missed):

  1. Make sure the number of models matches the input SynGO JSON (the step Dustin is working on).
  2. Load the TTL models into the model repo and make available on dev
  3. Dustin will discuss with @kltm about the possibility of running the same "pipeline test" code (not sure if that's the right name for it-- Dustin please correct me) that he used for the MGI Noctua migration testing. We can check the number of standard GO annotations (GPAD) created from this step and fix internal issues if any, and/or contact SynGO if necessary.
  4. When step 3 looks OK (maybe based on a review by Pascale?) we will ask the SynGO team (currently Frank, I think) to review the models on dev, if he would like. We will suggest a two week review window.
  5. When SynGO is OK with the models, we will move to prod and incorporate in the next GO release. (maybe Pascale will check that the numbers match?)
dustine32 commented 2 years ago

@thomaspd Test files available now here: http://skyhook.berkeleybop.org/issue-238-wormbase-test-pipeline/annotations/

dustine32 commented 2 years ago

I'll plan on fixing SYNGO:1805 and updating the PR.

dustine32 commented 2 years ago

@thomaspd Noting that the one missing model SYNGO:1805 has been fixed and added to this PR: https://github.com/geneontology/noctua-models/pull/235#issuecomment-1159328620

kltm commented 2 years ago

@kltm will make a noctua_*.gpad.gz file for issue-237-mgi-test-pipeline and last release for @pgaudet to compare difference.

dustine32 commented 2 years ago

@kltm Sorry, confusingly the test pipeline with the latest SynGO data is issue-238-wormbase-test-pipeline (not the mgi one).

kltm commented 2 years ago

@pgaudet I've emailed you a pair of files as discussed. As a note to @dustine32 , they were generated with:

zgrep -i [[:space:]]SynGO[[:space:]] noctua_*.gpad.gz

pgaudet commented 2 years ago

Hi @kltm @dustine32

I checked both the old and new SynGO data set, everything looks ok to me too.

Summary:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Old | New -- | -- | -- Number of annotations | 26371 | 43722 Distinct ECO | 31 | 36 Distinct GO | 235 | 256 Distinct PMID | 1110 | 1467

We can go ahead and release the data.

Thanks, Pascale

pgaudet commented 2 years ago

Ready to close

kltm commented 2 years ago

While we have tested this specific load, what should the ordered SOP be moving forward and do we want to tie it to a manual process or pipeline?

kltm commented 2 years ago

Putting this on hold until we can create a mini-project for this.