Closed pgaudet closed 3 months ago
Assigning @kltm because we need your input to proceed with this.
Isn't the prefix MGI
? And the local ID values themselves also contain MGI:
. Like this: (MGI:)(MGI:96182)
. So if the second MGI
is missing, it's not a dbxrefs file problem, but a problem with the software or data? Sorry if I'm jumping into something without context!
Re: "software or data?" The answer is both: the data is wrong according to our standards and we are not fixing it. In a perfect world, the first is not true and the failure of the second is not necessary. But, alas... IIRC, there is an issue about examining the IDs in the "with" column in the GORULES tracker somewhere. There should be some basic checking there, although we have never used the regexps that were added after our pipeline was established (IIRC, added by Tony later on to align metadata a little). MGI has always been a special case and, until we purge that historical choice from the data stream, it's something that we just have to deal with.
We'd have to look at the flow, but I believe all files (sans uniprot) pass through ontobio at some point and are parsed, so that would probably be the most expeditious place to catch things: python parse. Ideally, our internally produced files are not making the mistake when emitting data (i.e. minerva and PANTHER/PAINT), but as long as it doesn't make it out to end users, it doesn't matter too much. Unfortunately, that means that GO-CAM files /do/ get out as there is no QC occurring there--a running frustration.
I think that the best thing to do for the moment would be to:
Again, any TTL/GO-CAM issues are "invisible" to us for the time being, so it's better to err on the side of caution.
Noting too that the GPAD currently emitted by minerva is a bit between specs, IIRC. That makes it a little harder to define what should happen, but that's fine for the moment as long as it is internally consistent.
Noting that GOA filters out this data (ie with that have single "MGI:" as the prefix).
Related or same as https://github.com/geneontology/go-site/issues/1218
From the test GAF, tests #4-9 are not failing.
@mugitty It looks like at least the namespace of the 'with' (GAF column 8) is checked in gorule-0000001 (GORULE_TEST:0000001-19)
So we should define exactly what is checked in gorule-0000001 and narrow the scope of gorule-0000027
GORULE_TEST:0000027-1 GORULE_TEST:0000027-2 GORULE_TEST:0000027-3 GORULE_TEST:0000027-8 are failing gorule-0000001
Now - gorule-0000027 picks up tests 1, 3 and 4
! FAILS GORULE:0000027 - TEST 1 - Prefix not in /db-xrefs.yaml
UniPotKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:23072806 IDA P GORULE_TEST:0000027-1 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central! FAILS GORULE:0000027 - TEST 3 - Bad reference syntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:PMID:14561399 IDA P GORULE_TEST:0000027-3 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central! FAILS GORULE:0000027 - TEST 4 - Bad reference syntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:unpublished IDA P GORULE_TEST:0000027-4 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central
but not 2,5, and 6
! FAILS GORULE:0000027 - TEST 2 - Assigned_by not in /groups.yaml
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:23072806 IDA P GORULE_TEST:0000027-2 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 SGDDB! FAILS GORULE:0000027 - TEST 5 - Bad referencesyntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID: IDA P GORULE_TEST:0000027-5 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central
OK, this is is the scope of GORULE-0000001 since there is no value at all after the namespace.
! FAILS GORULE:0000027 - TEST 6 - Bad reference syntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID:0. IDA P GORULE_TEST:0000027-6 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central
Should have been picked up ? ID syntax is
database: PMID id_syntax: '[0-9]+'
GORULE-0000027 is also picking up tests that I was not expecting
GORULE_TEST:0000001-6 WARNING - Invalid identifier:GORULE:0000027: X not found in list of database names in dbxrefs--
PomBase SPAC25B8.17 ypf1 is_active_in GO:0005634 GO_REF:0000024 ISO SGD:S000001583 C GORULE_TEST:0000001-6 intramembrane aspartyl protease of the perinuclear ER membrane Ypf1 (predicted) ppp81 protein taxon:4896 3/5/15 PomBase part_of(X:1)
UniProtKB O76187 darA enables GO:0005515 PMID:9802899 IPI UniProtKB:P34149 F GORULE_TEST:0000051-PASS1 Darlin darA protein taxon:44689 20100205 GO_Central has_input(GO:0003674)|occurs_in(CL:123456)
FB FBgn0011273 Acam part_of GO:0008023 FB:FBrf0193169|PMID:16790438 IDA C GORULE_TEST:0000061-1 Androcam ACaM|And|CG17769|CalB|Calmodulin-related 97A|Camr97A|androcalmodulin|androcam protein taxon:7227 20180501 GO_Central
MGI:1100518 Smad7 bla involved_in GO:0017015 MGI:MGI:3836072|PMID:18952608 IC GO:0060389 P GORULE_TEST:0000020-3 SMAD protein_coding_gene taxon:10090 20090211 GO_Central
UniProtKB P77335 hlyE located_in GO:0020002 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0375 C GORULE_TEST:0000029-1 protein taxon:83333 20220807 GO_Central
UniProtKB P77335 hlyE located_in GO:0020002 GO_REF:0000044 IEA UniProtKB-SubCell:SL-0375 C GORULE_TEST:0000029-2 protein taxon:83333 20200507 GO_Central
UniProtKB P77335 hlyE located_in GO:0020002 GO_REF:0000045 IEA UniProtKB-SubCell:SL-0375 C GORULE_TEST:0000030-1 protein taxon:83333 20230607 GO_Central
! FAILS GORULE:0000027 - TEST 5 - Bad referencesyntax
UniProtKB Q9HC96 CAPN10 involved_in GO:0006921 PMID: IDA P GORULE_TEST:0000027-5 Calpain-10 CAPN10,KIAA1845 protein taxon:9606 20140213 GO_Central
from test for gorule-0000027-5 to test for gorule-0000001-29 (and renamed gorule-0000027-6 to gorule-0000027-5 to avoid gaps)
All tests are failing as expected.
Hello,
@alexsign reported that some 'with' data in the exported Noctua GPADs contain "MGI" rather than "MGI:MGI". https://github.com/geneontology/go-site/blob/master/metadata/rules/gorule-0000027.md mentions that all db prefixes should be found in the dbxref file
Note that the rule states
However for MGI the database field is MGI, not MGI:MGI.
@kltm do we need to change the dbxref to align with this?