geneontology / noctua-models

This is the data repository for the models created and edited with the Noctua tool stack for GO.
http://noctua.geneontology.org/
Creative Commons Attribution 4.0 International
10 stars 3 forks source link

Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase #271

Closed kltm closed 6 months ago

kltm commented 7 months ago

~Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase. The theory here is that it is likely that a small number of ontology issues could explain the difference.~

~The best files for comparison at this moment (while the pipeline creates a more recent batch that should only differ in an ORCID fix) are at http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/ . Namely:~

~http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/pre_import_sgd.gpad (GPAD 2.0, before minerva ingest)~ ~http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/post_import_sgd.gpad (GPAD ~1.1, export from minerva)~

See lower down at: https://github.com/geneontology/noctua-models/issues/271#issuecomment-1945342934 (Also see https://github.com/geneontology/minerva/issues/539)

kltm commented 7 months ago

I would encourage people to start with the files above, as I think I made a mistake. Just personally poking around, looking through the col4s, sorting, uniqing, and diffing them:

cat pre_import_sgd.gpad | grep -v '^!' | cut -f 4 | sort | uniq > pre_terms.txt
cat post_import_sgd.gpad | grep -v '^!' | cut -f 4 | sort | uniq > post_terms.txt
diff pre_terms.txt post_terms.txt | grep '>' | cut -f 2 -d ' ' > exclusive_post_terms.txt 
diff pre_terms.txt post_terms.txt | grep '<' | cut -f 2 -d ' ' > exclusive_pre_terms.txt 

Results: http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/exclusive_pre_terms.txt (terms that exclusively appear before the import) http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/exclusive_post_terms.txt (terms that exclusively appear after the import)

kltm commented 7 months ago

Tagging @cmungall @pgaudet @suzialeksander @dustine32 @sierra-moxon

cmungall commented 7 months ago

It's not surprising that the two files are different, because the post file includes models that were created de-novo in Noctua, and were never imported into Protein2GO

For example, this one: http://noctua.geneontology.org/editor/graph/gomodel:600ced8500001127?

Was manually created by Stacia in Noctua in 2021.

So this is naturally in the complete set of SGD annotations that get exported from Noctua. However, if the goal is to compare pre and post then the exact same annotations need to be compared and models that were manually curated in Noctua need to be subtracted out for comparison purposes.

kltm commented 7 months ago

A filter can be added then by subtracting the noctua_sgd.gpad set from the current release (http://current.geneontology.org/products/upstream_and_raw_data/noctua_sgd-src.gpad.gz), which would give the "addition set".

cmungall commented 7 months ago

There are also some other odd things going on with timing

@kltm can you provide more of a timeline for these files?

Here is GO:0043044 in the post file:

grep GO:0043044 post_import_sgd.gpad
SGD S000000966  enables GO:0140658  PMID:33174727   ECO:0000314         20210706    SGD part_of(GO:0043044) contributor=https://orcid.org/0000-0001-5472-917X|noctua-model-id=gomodel:600ced8500001127|model-state=production
SGD S000000966  involved_in GO:0043044  PMID:33174727   ECO:0000314         20210706    SGD     contributor=https://orcid.org/0000-0001-5472-917X|noctua-model-id=gomodel:600ced8500001127|model-state=production

This comes from the Noctua-native model I mentioned above. Note the annotation is to GO:0043044, which was merged into GO:0006338 in 2021:

The current model has been successfully migrated:

http://noctua.geneontology.org/download/gomodel:600ced8500001127/gpad

I believe the model file was migrated to the new term some time ago (on a flight w very slow wifi so can't check), but the history should be here: https://github.com/geneontology/noctua-models/blob/master/models/600ced8500001127.ttl

so how is it possible an annotation to GO:0043044 made its way into the post file?

cmungall commented 7 months ago

Here is the commit, which was in April ~2024~ 2022 https://github.com/geneontology/noctua-models/commit/7725dbe051c9688f94e0f2527559cbe1ee355a14

This replaced GO:0043044 with GO:0006338.

kltm commented 7 months ago

A lot of scrambling around in the last week, so let's soft reset this all:

(we took some shortcuts as we didn't have some proper test runs, so wrong files may have been grabbed; Suzi's initial finds, however, were from the pseudo-GPAD produced by the GPAD output workbench on the newly live models, and our initial focus was at that end)

dustine32 commented 6 months ago

the gpad that was used to produce the imported TTLs

@kltm Here is the permalink to the file in GH: https://raw.githubusercontent.com/geneontology/sgd-go-cams/bda1b9d21b830f91882081b29a8f5f0b07fbc631/products/go_cam_sgd_valid.gpad

kltm commented 6 months ago

Okay, resetting this all, I've created a new export file with:

sh ./local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --import-owl-models -f ~/local/src/git/sgd-go-cams/models -j /tmp/blazegraph.jnl
mkdir -p /tmp/legacy/gpad && MINERVA_CLI_MEMORY=8G ./local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --lego-to-gpad-sparql --ontology https://current.geneontology.org/ontology/extensions/go-lego.owl --ontojournal ontojournal.jnl -i /tmp/blazegraph.jnl --gpad-output /tmp/legacy/gpad
cat /tmp/legacy/gpad/*.gpad | grep -v '^!' > /tmp/sgd_export.gpad

The pre and post import files are now at: http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/sgd_import.gpad http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/sgd_export.gpad

kltm commented 6 months ago
cat sgd_import.gpad | grep -v '^!' | cut -f 4 | sort | uniq > pre_terms.txt
cat sgd_export.gpad | grep -v '^!' | cut -f 4 | sort | uniq > post_terms.txt
diff pre_terms.txt post_terms.txt | grep '>' | cut -f 2 -d ' ' > exclusive_post_terms.txt 
diff pre_terms.txt post_terms.txt | grep '<' | cut -f 2 -d ' ' > exclusive_pre_terms.txt 
kltm commented 6 months ago
wc -l *.gpad
   55197 sgd_export.gpad
   50616 sgd_import.gpad
pgaudet commented 6 months ago

Hi @kltm Thanks for sharing these.

The EXPORT file contains 4,583 lines more than the IMPORT file There are 2055 GO terms in extensions in the import file, and the unfolding step of the GO pipeline instantiates these as annotations, so as far as I can tell, that the 'real' diff is 2,528. Inferences only account for 29 additional annotations (see below)

Calculation of interences There are only 10 IDs that I find in the EXPORT file that are not anywhere in the IMPORT file; here are the GO terms, with the number of occurrences in the EXPORT file:

GOID LABEL COUNT in EXPORT FILE
GO:0022414 reproductive process 14
GO:0042918 alkanesulfonate transport' 1
GO:0042959 alkanesulfonate transmembrane transporter activity' 1
GO:0061425 positive regulation of ethanol catabolic process by positive regulation of transcription from RNA polymerase II promoter 2
GO:0071705 nitrogen compound transport 1
GO:0072337 modified amino acid transport 3
GO:1900068 negative regulation of cellular response to alkaline pH 3
GO:1900070 negative regulation of cellular hyperosmotic salinity response 2
GO:1900072 positive regulation of sulfite transport 1
GO:1903047 mitotic cell cycle process 1

The 'worse' one is reproductive process, but I removed the logical definition, so this should go.

cmungall commented 6 months ago

These are reflecting broader problems with GO rather than anything to do with the process per se

Let's take nitrogen compound transport. What an awful, useless term. If not obsoleted, it should at least have a do-not-annotate (which should block propagation). There are 78894 chemical entities, big and small, classified under CHEBI:51143. I'm surprised we have direct annotations to it even more surprised to see IBAs.

If we look at the source publication for many of these https://amigo.geneontology.org/amigo/reference/PMID:24842606

We see that the curator clearly wanted the more useful protoporphyrin transport but we didn't have this so they chose the most specific term and then did an extension. Then when it gets propagated by IBA only the useless nitrogen compound transport is propagated.

I know this seems a bit off topic but this is common in GO where we attach the symptoms rather than causes, which is much more expensive

Others are coming from F->P. E.g.

id: GO:0042959
name: alkanesulfonate transmembrane transporter activity
...
relationship: part_of GO:0042918 {http://purl.org/dc/terms/source="GO_REF:0000090"} ! alkanesulfonate transport

This will be addressed by the current refactoring.

So overall my opinion is these are minor additions that are valid yet trivial and will eventually disappear as the ontology improves. Of course we must still decouple inference from conversion

suzialeksander commented 6 months ago

@suzialeksander will make ontology tickets for the remaining terms to be tagged do_not_annotate

suzialeksander commented 6 months ago

https://github.com/geneontology/go-ontology/issues/27254

suzialeksander commented 6 months ago

Origin of increase seems to be identified. SGD is manually fixing some annotations that didn't make the move, but after above ticket there don't seem to be other additional annotations, just "unfolded" extensions.