geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

gorule-0000027 misses some invalid ID in the with/field #2063

Closed pgaudet closed 3 months ago

pgaudet commented 1 year ago

Hello,

@alexsign reported that some 'with' data in the exported Noctua GPADs contain "MGI" rather than "MGI:MGI". https://github.com/geneontology/go-site/blob/master/metadata/rules/gorule-0000027.md mentions that all db prefixes should be found in the dbxref file

Note that the rule states

In all cases, the prefix MUST be in db-xrefs.yaml. The prefix SHOULD be identical (case-sensitive match) to the database field. If it does not match then it MUST be identical (case-sensitive) to one of the synonyms.

However for MGI the database field is MGI, not MGI:MGI.

@kltm do we need to change the dbxref to align with this?

pgaudet commented 1 year ago

Assigning @kltm because we need your input to proceed with this.

balhoff commented 1 year ago

Isn't the prefix MGI? And the local ID values themselves also contain MGI:. Like this: (MGI:)(MGI:96182). So if the second MGI is missing, it's not a dbxrefs file problem, but a problem with the software or data? Sorry if I'm jumping into something without context!

kltm commented 1 year ago

Re: "software or data?" The answer is both: the data is wrong according to our standards and we are not fixing it. In a perfect world, the first is not true and the failure of the second is not necessary. But, alas... IIRC, there is an issue about examining the IDs in the "with" column in the GORULES tracker somewhere. There should be some basic checking there, although we have never used the regexps that were added after our pipeline was established (IIRC, added by Tony later on to align metadata a little). MGI has always been a special case and, until we purge that historical choice from the data stream, it's something that we just have to deal with.

We'd have to look at the flow, but I believe all files (sans uniprot) pass through ontobio at some point and are parsed, so that would probably be the most expeditious place to catch things: python parse. Ideally, our internally produced files are not making the mistake when emitting data (i.e. minerva and PANTHER/PAINT), but as long as it doesn't make it out to end users, it doesn't matter too much. Unfortunately, that means that GO-CAM files /do/ get out as there is no QC occurring there--a running frustration.

I think that the best thing to do for the moment would be to:

Again, any TTL/GO-CAM issues are "invisible" to us for the time being, so it's better to err on the side of caution.

kltm commented 1 year ago

Noting too that the GPAD currently emitted by minerva is a bit between specs, IIRC. That makes it a little harder to define what should happen, but that's fine for the moment as long as it is internally consistent.

pgaudet commented 1 year ago

Noting that GOA filters out this data (ie with that have single "MGI:" as the prefix).

pgaudet commented 1 year ago

Related or same as https://github.com/geneontology/go-site/issues/1218

pgaudet commented 1 year ago

From the test GAF, tests #4-9 are not failing.

pgaudet commented 11 months ago

@mugitty It looks like at least the namespace of the 'with' (GAF column 8) is checked in gorule-0000001 (GORULE_TEST:0000001-19)

pgaudet commented 11 months ago

So we should define exactly what is checked in gorule-0000001 and narrow the scope of gorule-0000027

GORULE_TEST:0000027-1 GORULE_TEST:0000027-2 GORULE_TEST:0000027-3 GORULE_TEST:0000027-8 are failing gorule-0000001

pgaudet commented 3 months ago

Now - gorule-0000027 picks up tests 1, 3 and 4

! FAILS GORULE:0000027 - TEST 1 - Prefix not in /db-xrefs.yaml
UniPotKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:23072806 IDA P GORULE_TEST:0000027-1 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

! FAILS GORULE:0000027 - TEST 3 - Bad reference syntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:PMID:14561399 IDA P GORULE_TEST:0000027-3 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

! FAILS GORULE:0000027 - TEST 4 - Bad reference syntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:unpublished IDA P GORULE_TEST:0000027-4 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

but not 2,5, and 6

! FAILS GORULE:0000027 - TEST 2 - Assigned_by not in /groups.yaml
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:23072806 IDA P GORULE_TEST:0000027-2 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 SGDDB

! FAILS GORULE:0000027 - TEST 5 - Bad referencesyntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID: IDA P GORULE_TEST:0000027-5 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

OK, this is is the scope of GORULE-0000001 since there is no value at all after the namespace.

! FAILS GORULE:0000027 - TEST 6 - Bad reference syntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:0. IDA P GORULE_TEST:0000027-6 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central

Should have been picked up ? ID syntax is

database: PMID id_syntax: '[0-9]+'

pgaudet commented 3 months ago

GORULE-0000027 is also picking up tests that I was not expecting

GORULE_TEST:0000001-6 WARNING - Invalid identifier:GORULE:0000027: X not found in list of database names in dbxrefs--PomBase SPAC25B8.17 ypf1 is_active_in GO:0005634 GO_REF:0000024 ISO SGD:S000001583 C GORULE_TEST:0000001-6 intramembrane aspartyl protease of the perinuclear ER membrane Ypf1 (predicted) ppp81 protein taxon:4896 3/5/15 PomBase part_of(X:1)

pgaudet commented 3 months ago

from test for gorule-0000027-5 to test for gorule-0000001-29 (and renamed gorule-0000027-6 to gorule-0000027-5 to avoid gaps)

pgaudet commented 3 months ago

All tests are failing as expected.