Closed kltm closed 8 months ago
I would encourage people to start with the files above, as I think I made a mistake. Just personally poking around, looking through the col4s, sorting, uniqing, and diffing them:
cat pre_import_sgd.gpad | grep -v '^!' | cut -f 4 | sort | uniq > pre_terms.txt
cat post_import_sgd.gpad | grep -v '^!' | cut -f 4 | sort | uniq > post_terms.txt
diff pre_terms.txt post_terms.txt | grep '>' | cut -f 2 -d ' ' > exclusive_post_terms.txt
diff pre_terms.txt post_terms.txt | grep '<' | cut -f 2 -d ' ' > exclusive_pre_terms.txt
Results: http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/exclusive_pre_terms.txt (terms that exclusively appear before the import) http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/exclusive_post_terms.txt (terms that exclusively appear after the import)
Tagging @cmungall @pgaudet @suzialeksander @dustine32 @sierra-moxon
It's not surprising that the two files are different, because the post file includes models that were created de-novo in Noctua, and were never imported into Protein2GO
For example, this one: http://noctua.geneontology.org/editor/graph/gomodel:600ced8500001127?
Was manually created by Stacia in Noctua in 2021.
So this is naturally in the complete set of SGD annotations that get exported from Noctua. However, if the goal is to compare pre and post then the exact same annotations need to be compared and models that were manually curated in Noctua need to be subtracted out for comparison purposes.
A filter can be added then by subtracting the noctua_sgd.gpad set from the current release (http://current.geneontology.org/products/upstream_and_raw_data/noctua_sgd-src.gpad.gz), which would give the "addition set".
There are also some other odd things going on with timing
@kltm can you provide more of a timeline for these files?
Here is GO:0043044 in the post file:
grep GO:0043044 post_import_sgd.gpad
SGD S000000966 enables GO:0140658 PMID:33174727 ECO:0000314 20210706 SGD part_of(GO:0043044) contributor=https://orcid.org/0000-0001-5472-917X|noctua-model-id=gomodel:600ced8500001127|model-state=production
SGD S000000966 involved_in GO:0043044 PMID:33174727 ECO:0000314 20210706 SGD contributor=https://orcid.org/0000-0001-5472-917X|noctua-model-id=gomodel:600ced8500001127|model-state=production
This comes from the Noctua-native model I mentioned above. Note the annotation is to GO:0043044, which was merged into GO:0006338 in 2021:
The current model has been successfully migrated:
http://noctua.geneontology.org/download/gomodel:600ced8500001127/gpad
I believe the model file was migrated to the new term some time ago (on a flight w very slow wifi so can't check), but the history should be here: https://github.com/geneontology/noctua-models/blob/master/models/600ced8500001127.ttl
so how is it possible an annotation to GO:0043044 made its way into the post file?
Here is the commit, which was in April ~2024~ 2022 https://github.com/geneontology/noctua-models/commit/7725dbe051c9688f94e0f2527559cbe1ee355a14
This replaced GO:0043044 with GO:0006338.
A lot of scrambling around in the last week, so let's soft reset this all:
(we took some shortcuts as we didn't have some proper test runs, so wrong files may have been grabbed; Suzi's initial finds, however, were from the pseudo-GPAD produced by the GPAD output workbench on the newly live models, and our initial focus was at that end)
the gpad that was used to produce the imported TTLs
@kltm Here is the permalink to the file in GH: https://raw.githubusercontent.com/geneontology/sgd-go-cams/bda1b9d21b830f91882081b29a8f5f0b07fbc631/products/go_cam_sgd_valid.gpad
Okay, resetting this all, I've created a new export file with:
sh ./local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --import-owl-models -f ~/local/src/git/sgd-go-cams/models -j /tmp/blazegraph.jnl
mkdir -p /tmp/legacy/gpad && MINERVA_CLI_MEMORY=8G ./local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --lego-to-gpad-sparql --ontology https://current.geneontology.org/ontology/extensions/go-lego.owl --ontojournal ontojournal.jnl -i /tmp/blazegraph.jnl --gpad-output /tmp/legacy/gpad
cat /tmp/legacy/gpad/*.gpad | grep -v '^!' > /tmp/sgd_export.gpad
The pre and post import files are now at: http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/sgd_import.gpad http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/sgd_export.gpad
cat sgd_import.gpad | grep -v '^!' | cut -f 4 | sort | uniq > pre_terms.txt
cat sgd_export.gpad | grep -v '^!' | cut -f 4 | sort | uniq > post_terms.txt
diff pre_terms.txt post_terms.txt | grep '>' | cut -f 2 -d ' ' > exclusive_post_terms.txt
diff pre_terms.txt post_terms.txt | grep '<' | cut -f 2 -d ' ' > exclusive_pre_terms.txt
wc -l *.gpad
55197 sgd_export.gpad
50616 sgd_import.gpad
Hi @kltm Thanks for sharing these.
The EXPORT file contains 4,583 lines more than the IMPORT file There are 2055 GO terms in extensions in the import file, and the unfolding step of the GO pipeline instantiates these as annotations, so as far as I can tell, that the 'real' diff is 2,528. Inferences only account for 29 additional annotations (see below)
Calculation of interences There are only 10 IDs that I find in the EXPORT file that are not anywhere in the IMPORT file; here are the GO terms, with the number of occurrences in the EXPORT file:
GOID | LABEL | COUNT in EXPORT FILE |
---|---|---|
GO:0022414 | reproductive process | 14 |
GO:0042918 | alkanesulfonate transport' | 1 |
GO:0042959 | alkanesulfonate transmembrane transporter activity' | 1 |
GO:0061425 | positive regulation of ethanol catabolic process by positive regulation of transcription from RNA polymerase II promoter | 2 |
GO:0071705 | nitrogen compound transport | 1 |
GO:0072337 | modified amino acid transport | 3 |
GO:1900068 | negative regulation of cellular response to alkaline pH | 3 |
GO:1900070 | negative regulation of cellular hyperosmotic salinity response | 2 |
GO:1900072 | positive regulation of sulfite transport | 1 |
GO:1903047 | mitotic cell cycle process | 1 |
The 'worse' one is reproductive process, but I removed the logical definition, so this should go.
These are reflecting broader problems with GO rather than anything to do with the process per se
Let's take nitrogen compound transport
. What an awful, useless term. If not obsoleted, it should at least have a do-not-annotate (which should block propagation). There are 78894 chemical entities, big and small, classified under CHEBI:51143. I'm surprised we have direct annotations to it even more surprised to see IBAs.
If we look at the source publication for many of these https://amigo.geneontology.org/amigo/reference/PMID:24842606
We see that the curator clearly wanted the more useful protoporphyrin transport
but we didn't have this so they chose the most specific term and then did an extension. Then when it gets propagated by IBA only the useless nitrogen compound transport
is propagated.
I know this seems a bit off topic but this is common in GO where we attach the symptoms rather than causes, which is much more expensive
Others are coming from F->P. E.g.
id: GO:0042959
name: alkanesulfonate transmembrane transporter activity
...
relationship: part_of GO:0042918 {http://purl.org/dc/terms/source="GO_REF:0000090"} ! alkanesulfonate transport
This will be addressed by the current refactoring.
So overall my opinion is these are minor additions that are valid yet trivial and will eventually disappear as the ontology improves. Of course we must still decouple inference from conversion
@suzialeksander will make ontology tickets for the remaining terms to be tagged do_not_annotate
Origin of increase seems to be identified. SGD is manually fixing some annotations that didn't make the move, but after above ticket there don't seem to be other additional annotations, just "unfolded" extensions.
~Explore SGD's pre-import and post-import GPADs to understand the origin of the 10% annotation increase. The theory here is that it is likely that a small number of ontology issues could explain the difference.~
~The best files for comparison at this moment (while the pipeline creates a more recent batch that should only differ in an ORCID fix) are at http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/ . Namely:~
~http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/pre_import_sgd.gpad (GPAD 2.0, before minerva ingest)~ ~http://skyhook.berkeleybop.org/confinement-for-noctua-models-271/post_import_sgd.gpad (GPAD ~1.1, export from minerva)~
See lower down at: https://github.com/geneontology/noctua-models/issues/271#issuecomment-1945342934 (Also see https://github.com/geneontology/minerva/issues/539)