geneontology / amigo

AmiGO is the public interface for the Gene Ontology.
http://amigo.geneontology.org
BSD 3-Clause "New" or "Revised" License
29 stars 17 forks source link

Some genes are missing PTHR family info #546

Open pgaudet opened 5 years ago

pgaudet commented 5 years ago

Hello,

Looks like this one is not yet resolved but I cannot find any open ticket. See for example: http://amigo.geneontology.org/amigo/gene_product/UniProtKB:T1ESS4

This gene has IBA annotations, technically it should belong to a family.

@kltm Can you please have a look?

Thanks, Pascale

cmungall commented 5 years ago

This was the previous ticket: https://github.com/geneontology/amigo/issues/532#issuecomment-430349876

kltm commented 5 years ago

Note that this seems to exist in the arbre set in the release. sjcarbon@moiraine:/tmp/foo$:( grep -i -r "T1ESS4" * PTHR12623/PTHR12623.arbre:AN94:HELRO|EnsemblGenome=HelroG162517|UniProtKB=T1ESS4;

kltm commented 5 years ago

Everything in family pthr12623 http://amigo.geneontology.org/amigo/search/annotation?q=*:*&fq=panther_family_label:%22ngfi-a%20binding%20protein%20pthr12623%22&sfq=document_category:%22annotation%22

kltm commented 5 years ago

arbre line structure appears okay, no found issues in the log with collisions. As above, most of rest of family seems to load fine. However, there seem to be a couple others missing in this family.

AN84:CAEEL|WormBase=WBGene00003107|UniProtKB=Q22002;
AN90:DROME|FlyBase=FBgn0259986|UniProtKB=Q59E55;
AN94:HELRO|EnsemblGenome=HelroG162517|UniProtKB=T1ESS4;

Now, the first two appear to be missing from the AmiGO load, possibly not in the upstream file, so it's not an issue that they do not appear (naturally). The last one, what started us, appears with partial information.

kltm commented 5 years ago
bbop@wok:/home/skyhook/release/annotations$ zgrep -i "[[:space:]]T1ESS4[[:space:]]" *.gaf.gz
goa_uniprot_all.gaf.gz:UniProtKB    T1ESS4  20199624        GO:0005634  GO_REF:0000002  IEA InterPro:IPR006988|InterPro:IPR006989|InterPro:IPR039040    C   Uncharacterized protein 20199624|HELRODRAFT_162517  protein taxon:6412  20181103    InterPro        
goa_uniprot_all.gaf.gz:UniProtKB    T1ESS4  20199624        GO:0006355  GO_REF:0000002  IEA InterPro:IPR006988|InterPro:IPR039040   P   Uncharacterized protein 20199624|HELRODRAFT_162517  protein taxon:6412  20181103    InterPro        
goa_uniprot_all.gaf.gz:UniProtKB    T1ESS4  20199624        GO:0045892  GO_REF:0000002  IEA InterPro:IPR006989  P   Uncharacterized protein 20199624|HELRODRAFT_162517  protein taxon:6412  20181103    InterPro        

Okay, that's all our inputs. Note that none of these are what appears in AmiGO, which is disturbing.

kltm commented 5 years ago
bbop@wok:/home/skyhook/release/products/annotations$ zgrep -i "[[:space:]]T1ESS4[[:space:]]" paint_*.gaf.gz
paint_other.gaf.gz:UniProtKB    T1ESS4  HELRODRAFT_162517       GO:0006355  PMID:21873635   IBA MGI:MGI:107563|MGI:MGI:107564|PANTHER:PTN000290064  P   Uncharacterized protein UniProtKB:T1ESS4|PTN002651296   protein taxon:6412  20170228    GO_Central      
paint_other.gaf.gz:UniProtKB    T1ESS4  HELRODRAFT_162517       GO:0005634  PMID:21873635   IBA MGI:MGI:107563|PANTHER:PTN000290064 C   Uncharacterized protein UniProtKB:T1ESS4|PTN002651296   protein taxon:6412  20170228    GO_Central      
paint_other_noiea.gaf.gz:UniProtKB  T1ESS4  HELRODRAFT_162517       GO:0006355  PMID:21873635   IBA PANTHER:PTN000290064|MGI:MGI:107564|MGI:MGI:107563  P   Uncharacterized protein UniProtKB:T1ESS4|PTN002651296   protein taxon:6412  20170228    GO_Central      
paint_other_noiea.gaf.gz:UniProtKB  T1ESS4  HELRODRAFT_162517       GO:0005634  PMID:21873635   IBA PANTHER:PTN000290064|MGI:MGI:107563 CUncharacterized protein    UniProtKB:T1ESS4|PTN002651296   protein taxon:6412  20170228    GO_Central      
paint_other-src.gaf.gz:UniProtKB    T1ESS4  HELRODRAFT_162517       GO:0006355  PMID:21873635   IBA PANTHER:PTN000290064|MGI:MGI:107564|MGI:MGI:107563  P   Uncharacterized protein UniProtKB:T1ESS4|PTN002651296   protein taxon:6412  20170228    GO_Central      
paint_other-src.gaf.gz:UniProtKB    T1ESS4  HELRODRAFT_162517       GO:0005634  PMID:21873635   IBA PANTHER:PTN000290064|MGI:MGI:107563 CUncharacterized protein    UniProtKB:T1ESS4|PTN002651296   protein taxon:6412  20170228    GO_Central      

Note, the ones we want are the first two--they are not in the main products due to coming in from paint_other.

kltm commented 5 years ago

I think I have it--traced through with a modified owltools. The error seems to be an off-by-one in PANTHERTree.java in readyAnnotationDataCache.

kltm commented 5 years ago

Now picked-up on snapshot http://amigo-exp.geneontology.io/amigo/gene_product/UniProtKB:T1ESS4 Should clear on next release.

pgaudet commented 5 years ago

This looks done ?

pgaudet commented 5 years ago

I just found another gene with the PTHR missing, CG4678 Drosophila: image

Can we have a query that checks that all annotations with IBA evidence also have a PTHR family?

Thanks, Pascale

kltm commented 5 years ago

@pgaudet We currently have no query framework for items in the Solr load. That would be a separate feature request.

kltm commented 5 years ago

FB:FBgn0030778 http://amigo.geneontology.org/amigo/search/annotation?q=*:*&fq=bioentity:%22FB:FBgn0030778%22&sfq=document_category:%22annotation%22

kltm commented 5 years ago

From the current release:

bbop@wok:/tmp/foo$ grep -r "FBgn0030778" *
PTHR11532/PTHR11532.arbre:AN457:DROME|FlyBase=FBgn0030778|UniProtKB=B7Z0Z5;
kltm commented 5 years ago

@dustine32 This is interesting. In the tree files, the gene identifiers appear to be out of sync with what the GO generally uses--there appears to be a conflation with the label for the contributor (e.g. "FlyBase") and the CURIE namespace (e.g. "FB"). For the algorithm to work as it currently does, the following replacements should be made:

The currently used CURIE namespaces for GO are:

bbop@wok:/home/skyhook/release/annotations$ zgrep --no-filename -v "^!" *.gaf.gz | cut -f 1 | sort | uniq -c | sort -n
     15 PAMGO_VMD
    110 NCBI_GP
    183 NCBI_NP
    304 ASAP
   1349 SGN
   1498 ComplexPortal
   3639 PseudoCAP
   5119 EcoGene
  26627 NCBI
  31587 GeneDB
  47260 GR_protein
  52074 JCVI_CMR
  54523 PomBase
  74669 dictyBase
  92619 TIGR_CMR
 112832 FB
 118016 WB
 127308 SGD
 217989 ZFIN
 230582 TAIR
 292089 CGD
 405751 MGI
 436843 RGD
 634197 AspGD
13534661 RNAcentral
586123510 UniProtKB

The non-CURIE gene identifiers, as they currently appear in the tree files, are:

bbop@wok:/tmp/foo$ grep "^AN" /tmp/an.txt | tr '|' '\n' | grep -v "^AN" | tr -d ';' | cut -d '=' -f 1 | sort | uniq -c | sort -n
      4 Araport
     45 Gene_Name
    245 GeneID
    592 CGD
   3237 EcoGene
   3820 Gene_OrderedLocusName
   4314 PomBase
   4581 SGD
   8227 dictyBase
  10324 FlyBase
  10952 Gene_ORFName
  12213 Xenbase
  13803 WormBase
  18816 RGD
  19358 HGNC
  20127 TAIR
  20365 ZFIN
  21039 MGI
 175481 Gene
 278029 Ensembl
 472806 EnsemblGenome
1098378 UniProtKB

These should be aligned with preference to the proper CURIE namespace.

dustine32 commented 5 years ago

Probably mentioned above but the amigo pipeline is pulling these mismatching identifiers from here: http://data.pantherdb.org/current/globals/tree_files.tar.gz

And this is a resource I added specifically for the amigo load. So I'm thinking we can simply transform the identifier namespaces in these files to the GO-friendly CURIE's. This would mirror the same namespace transformation logic performed by the IBA GAF generation script createGAF.pl.

@kltm Is it possible to try a test run once I generate the new tarball? Let's say I'll plan on putting it here: http://data.pantherdb.org/current_test/globals/tree_files.tar.gz

kltm commented 5 years ago

@dustine32 If you can wait until next week, after our release, you could also put them in the regular location and we can see what happens in the dailies.

dustine32 commented 5 years ago

@kltm Sure thing, thanks!