Closed dougli1sqrd closed 6 years ago
This report is from maybe a dozen lines or so of paint_other.
Pretty sure this is just description values containing unescaped semicolons (e.g. transcriptional activator activity; RNA polymerase II proximal promoter sequence-specific DNA binding
) in one of the intermediate input files ("paint_annotation"). The GAF generation script on the PAINT side is then splitting the one description field into multiple.
Working to clean these out right now.
UniProtKB I1KXF7 LOC100807794 GO:0004675 PMID:21873635 IBA PANTHER:PTN002405275|TAIR:locus:2013021|TAIR:locus:2013825 F Uncharacterized protein UniProtKB:I1KXF7|PTN001987762 protein taxon:3847 2017-02-28 GO_Central
UniProtKB E1BLE5 IL37 GO:0005125 PMID:21873635 IBA PANTHER:PTN000008572|UniProtKB:P01584|MGI:MGI:1859324|UniProtKB:A0A1D5P4U4|MGI:MGI:96543|RGD:2891|MGI:MGI:1916927|MGI:MGI:2449929 F Uncharacterized protein UniProtKB:E1BLE5|PTN000008606 protein taxon:9913 2018-01-18 GO_Central
UniProtKB B9GJL0 POPTR_0001s45960g GO:0010252 PMID:21873635 IBA PANTHER:PTN001610873|TAIR:locus:2035859 P Adventitious rooting related oxygenase family protein UniProtKB:B9GJL0|PTN001610903 protein taxon:3694 2018-03-07 GO_Central
Looks better! And the pipeline products are passing and validated, so I will close this.
The metadata/paint.yaml indicates ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_other.gaf.gz is the place to get paint_other.
Doing:
curl -L ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_other.gaf.gz | gzip -dcf | head -n 10
we get a bunch of paint annotations that appear to not have a date. There seems to be only 13 columns as well?This broke ontobio parsing. I can fix whatever robustness needs to occur in ontobio to not crash from this, but paint also looks like it needs fixing.