geneontology / go-releases

Tasks and notes for monthly GO releases
0 stars 0 forks source link

MGI reference issues #28

Closed pgaudet closed 1 year ago

pgaudet commented 1 year ago

The mgi-gaf file produced by GO Central contains 181,055 annotations with PMID but no MGI reference (see attached file).

I cannot find a pattern in the annotations with only the PMID; dates vary from 1999 to 2023, the annotations are mostly sourced from MGI, but not only:

Source  Number of annotations with no MGI reference
MGI 170356
SynGO 10327
UniProt 367
WB 5

I also find papers where some annotations have the MGI internal reference ID, but not others (MGI:MGI:3695190|PMID:16845371)

@LiNiMGI @kltm what is the process for injecting these MGI internal IDs?

Thanks, Pascale

pgaudet commented 1 year ago

For example http://noctua.geneontology.org/download/gomodel:5825564b00000099/gpad

Annotations to PMID:19968565 only have the PMID, annotations to PMID:11606467 also exports MGI:MGI:2153596.

These are from the same date, presumably the import process was the same?

ukemi commented 1 year ago

The import process was completely standardized and this discrepancy was noted/discussed by the GO group when the imports were done, but has fallen off the radar. We opted to not import the MGI reference identifiers into Noctua. Note that in the model above, no annotations contain the MGI identifiers in the reference field. However, for external annotations made made by groups like Ruth's, those annotations get cycled through the MGI import/export process. When MGI creates annotations we include both our internal reference and the PMID/external_reference as a pipe-separated value: MGI MGI:96785 Lhx2 acts_upstream_of_or_within GO:0007420 MGI:MGI:3050930|PMID:15295034 IMP MGI:MGI:1890208 P LIM homeobox protein 2 ap|apterous|Lh-2|LH2A protein_coding_gene taxon:10090 20060526 MGI What you are seeing is the result of the collation of two separate annotation sources: ones from GO that don't have the MGI reference information and ones from MGI that do. At some point, we need to revisit reference objects in GO and how we handle/represent them.

pgaudet commented 1 year ago

Wont fix - when GO central loads the data then mostly we will only have pubmeds.