geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Paint issues in paint_other: missing date and possibly more #681

Closed dougli1sqrd closed 6 years ago

dougli1sqrd commented 6 years ago

The metadata/paint.yaml indicates ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_other.gaf.gz is the place to get paint_other.

Doing: curl -L ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_other.gaf.gz | gzip -dcf | head -n 10 we get a bunch of paint annotations that appear to not have a date. There seems to be only 13 columns as well?

UniProtKGO_Central:571  DAPL1       GO:0070513  PMID:21873635   IBA PANTHER:PTN000323543|UniProtKB:P51397   F   Uncharacterized protein UniProtKB:F6V4U1|PTN001408148   protein taxon:9796
UniProtKGO_Central:571  menB        GO:0009234  PMID:21873635   IBA PANTHER:PTN000235258|UniProtKB:P9WNP5|EcoGene:EG11368   P   1,4-dihydroxy-2-naphthoyl-CoA synthase  UniProtKB:A9WBE1|PTN000235260   protein taxon:324602
UniProtKB   W5MTR8          GO:0005634  PMID:21873635   IBA PANTHER:PTN000129454|TAIR:locus:2024407|dictyBase:DDB_G0268410|UniProtKB:Q9H9J2 C   Uncharacterized protein UniProtKB:W5MTR8|PTN002598434   protein taxon:7918    GO_Central:57
UniProtKGO_Central20S0  PMID:21873635   IBATRIBUPANTHER:PTN000242188|UniProtKB:O75251|EcoGene:EG12083   F   F420H2 dehydrogenase subunit    UniProtKB:B1L7S0|PTN000945616   protein taxon:374847
UniProtKB   F7A024  RNF7        GO:0005634  PMID:21873635   IBA PANTHER:PTN000129805|PomBase:SPAC23H4.18c|SGD:S000005493|MGI:MGI:1337096|FB:FBgn0025638 C   Uncharacterized protein UniProtKB:F7A024|PTN000129834   protein GO_Central:57
UniProtKB   I1IEV8  LOC100835021        GO:0009535  PMID:21873635   IBA PANTHER:PTN000493738|TAIR:locus:2079117|TAIR:locus:2825741  C   Chlorophyll a-b binding protein, chloroplastic  UniProtKB:I1IEV8|PTN001855517protein    GO_Central:57
UniProtKGO_Central:579  CHLREDRAFT_174644       GO:0042795  PMID:21873635   IBA PANTHER:PTN000388289|FB:FBgn0038371 P   Predicted protein   UniProtKB:A8J1X9|PTN001031725   protein taxon:3055
UniProtKGO_Central:579  PMID:21873635422IBA PANTHER:PTN001501163|MGI:MGI:1277959    C   Putative uncharacterized protein    UniProtKB:E9GXQ9|PTN002350674   protein taxon:6669
UniProtKB   B8C506  THAPSDRAFT_5909     GO:0003700  PMID:21873635   IBA PANTHER:PTN000001154|SGD:S000003041|FB:FBgn0001222|TAIR:locus:2149050|SGD:S000001249    F   Uncharacterized protein UniProtKB:B8C506|PTN000797467protein    GO_Central8
UniProtKGO_Central:573  lta     GO:0008732  PMID:21873635   IBA PANTHER:PTN000033149|EcoGene:EG13690|SGD:S000000772 F   L-allo-threonine aldolase   UniProtKB:Q9HMP3|PTN000815888   protein taxon:64091

This broke ontobio parsing. I can fix whatever robustness needs to occur in ontobio to not crash from this, but paint also looks like it needs fixing.

dougli1sqrd commented 6 years ago

paint_other.report.log

This report is from maybe a dozen lines or so of paint_other.

dougli1sqrd commented 6 years ago

https://github.com/biolink/ontobio/pull/186

dustine32 commented 6 years ago

Pretty sure this is just description values containing unescaped semicolons (e.g. transcriptional activator activity; RNA polymerase II proximal promoter sequence-specific DNA binding) in one of the intermediate input files ("paint_annotation"). The GAF generation script on the PAINT side is then splitting the one description field into multiple.

Working to clean these out right now.

dougli1sqrd commented 6 years ago
UniProtKB   I1KXF7  LOC100807794        GO:0004675  PMID:21873635   IBA PANTHER:PTN002405275|TAIR:locus:2013021|TAIR:locus:2013825  F   Uncharacterized protein UniProtKB:I1KXF7|PTN001987762   protein taxon:3847  2017-02-28  GO_Central
UniProtKB   E1BLE5  IL37        GO:0005125  PMID:21873635   IBA PANTHER:PTN000008572|UniProtKB:P01584|MGI:MGI:1859324|UniProtKB:A0A1D5P4U4|MGI:MGI:96543|RGD:2891|MGI:MGI:1916927|MGI:MGI:2449929   F   Uncharacterized protein UniProtKB:E1BLE5|PTN000008606   protein taxon:9913  2018-01-18  GO_Central
UniProtKB   B9GJL0  POPTR_0001s45960g       GO:0010252  PMID:21873635   IBA PANTHER:PTN001610873|TAIR:locus:2035859 P   Adventitious rooting related oxygenase family protein   UniProtKB:B9GJL0|PTN001610903   protein taxon:3694  2018-03-07  GO_Central

Looks better! And the pipeline products are passing and validated, so I will close this.