geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Clarify distinction between AGI_LocusCode and TAIR #1972

Open cmungall opened 1 year ago

cmungall commented 1 year ago

Continued from:

We have both prefixes registered (note that no other registry acknowledges AGI_LocusCode)

Currently the GAF/GPI file from TAIR has duplicative entries:

AGI_LocusCode   AT2G47400   CP12-1  enables GO:0005507  TAIR:Communication:501789215    ISS UniProtKB:A6Q0K5    F   AT2G47400   AT2G47400|CP12-1|CP12|CP12 domain-containing protein 1|T8I13.24 protein taxon:3702  20210401    UniProt     TAIR:locus:2065220
TAIR    locus:2065220   CP12-1  involved_in GO:0080153  PMID:21873635   IBA PANTHER:PTN002142411|TAIR:locus:2065220|TAIR:locus:2011676|TAIR:locus:2096009   P   Calvin cycle protein CP12-1, chloroplastic  UniProtKB:O22914|PTN002142446   protein taxon:3702  20180622    GO_Central

These are the same gene:

https://www.arabidopsis.org/servlets/TairObject?accession=Locus:2065220

the policy for GO is to have a single representative entry for each gene. The GAF should always refer to that. Optionally a specific isoform can be indicated in c17 (and this will be the primary id in the GPAD)

This is the current distribution:

2722 UniProtKB (all IBA, indicating we were not able to map to a TAIR gene/locus) 39138 TAIR 225444 AGI_LocusCode

pgaudet commented 1 year ago

Is this a request for TAIR to change their GAF?

kltm commented 1 year ago

@pgaudet @tberardini What would be the best forum to talk about this? It is an open question on how to proceed here.

tberardini commented 1 year ago

Is there a big problem in leaving things as they are?

2722 UniProtKB - constant work in progress synchronize UniProt and TAIR mappings, coordination has been time-consuming 39138 TAIR - these are likely genetic loci, uncloned so they cannot be assigned an AGI_LocusCode 225444 AGI_LocusCode - everything else

kltm commented 1 year ago
bbop@wok:/home/skyhook/release/annotations$ zcat tair.gaf.gz | grep -v '^!' | cut -f 1,7 | sort | uniq -c
  19388 AGI_LocusCode   HDA
    262 AGI_LocusCode   HEP
    185 AGI_LocusCode   IC
  21868 AGI_LocusCode   IDA
  61052 AGI_LocusCode   IEA
   4622 AGI_LocusCode   IEP
   4046 AGI_LocusCode   IGI
  17108 AGI_LocusCode   IMP
  24483 AGI_LocusCode   IPI
  37753 AGI_LocusCode   ISM
   8174 AGI_LocusCode   ISS
    633 AGI_LocusCode   NAS
  18351 AGI_LocusCode   ND
    866 AGI_LocusCode   RCA
   6653 AGI_LocusCode   TAS
  32884 TAIR    IBA
     24 TAIR    IDA
     10 TAIR    IEP
     61 TAIR    IGI
    526 TAIR    IMP
      3 TAIR    IPI
     11 TAIR    ISS
     78 TAIR    NAS
   5499 TAIR    ND
     42 TAIR    TAS
   2722 UniProtKB   IBA
pgaudet commented 1 year ago

Most TAIR come from IBAs, which certainly do not corresponds to unmapped loci.

@dustine32 will look into the mappings from UniProt back to TAIR IDs in the PAINT pipeline.

kltm commented 1 year ago
bbop@wok:/home/skyhook/release/annotations$ zcat tair.gaf.gz | grep -v '^!' | cut -f 1 | sort | uniq -c
 225444 AGI_LocusCode
  39138 TAIR
   2722 UniProtKB