Open johnsolk opened 5 years ago
Ah. It's a problem with the gff3 file itself generated with the master version of dammit. An old gff3 file does just fine.
Command:
annotations = GFF3Parser(filename="A_xenica.trinity_out.Trinity.fasta.dammit.gff3").read()
No error. Output from annotations.head()
Dbxref ID Name Note Parent Target accuracy bitscore database end env_coords phase score seqid source start strand trunc type
0 NaN homology:425062 90820 NaN NaN 90820 1132 1240 + NaN NaN OrthoDB 366 NaN NaN 5.500000e-47 Transcript_100000 LAST 36 + NaN translated_nucleotide_match
1 "Pfam:PF13465.2" homology:53130 zf-H2C2_2 Zinc-finger double domain NaN zf-H2C2_2 1 25 + 0.94 NaN NaN 198 127 201 NaN 1.200000e-03 Transcript_100001 HMMER 126 NaN NaN protein_hmm_match
2 "Pfam:PF13894.2" homology:53126 zf-C2H2_4 C2H2-type zinc finger NaN zf-C2H2_4 2 23 + 0.94 NaN NaN 234 166 237 NaN 4.000000e-03 Transcript_100001 HMMER 168 NaN NaN protein_hmm_match
3 "Pfam:PF12171.4" homology:53128 zf-C2H2_jaz Zinc-finger double-stranded RNA-binding NaN zf-C2H2_jaz 4 23 + 0.93 NaN NaN 231 172 240 NaN 2.600000e-03 Transcript_100001 HMMER 171 NaN NaN protein_hmm_match
4 "Pfam:PF13912.2" homology:53133 zf-C2H2_6 C2H2-type zinc finger NaN zf-C2H2_6 4 24 + 0.96 NaN NaN 234 169 240 NaN 5.800000e-03 Transcript_100001 HMMER 171 NaN NaN protein_hmm_match
Any ideas how to fix the gff3 file after running dammit? Would prefer not to run dammit again on 17 of these.
Here's an example:
curl -L https://osf.io/bz5de/download -o F_diaphanus.trinity_out.Trinity.fasta.dammit.gff3
Has something changed with the way transcripts are sorted? Could this be the issue?
Old version of dammit with Transcript_100000
appearing first:
ljcohen@js-168-95:/kfish_annotations$ head A_xenica.trinity_out.Trinity.fasta.dammit.gff3
##gff-version 3.2.1
Transcript_100000 LAST translated_nucleotide_match 37 366 5.500000e-47 + . ID=homology:425062;Name=90820;Target=90820 1132 1240 +;database=OrthoDB
Transcript_100001 HMMER protein_hmm_match 127 198 1.200000e-03 . . ID=homology:53130;Name=zf-H2C2_2;Target=zf-H2C2_2 1 25 +;Note=Zinc-finger double domain;accuracy=0.94;env_coords=127 201;Dbxref="Pfam:PF13465.2"
Transcript_100001 HMMER protein_hmm_match 169 234 4.000000e-03 . . ID=homology:53126;Name=zf-C2H2_4;Target=zf-C2H2_4 2 23 +;Note=C2H2-type zinc finger;accuracy=0.94;env_coords=166 237;Dbxref="Pfam:PF13894.2"
Transcript_100001 HMMER protein_hmm_match 172 231 2.600000e-03 . . ID=homology:53128;Name=zf-C2H2_jaz;Target=zf-C2H2_jaz 4 23 +;Note=Zinc-finger double-stranded RNA-binding;accuracy=0.93;env_coords=172 240;Dbxref="Pfam:PF12171.4"
Transcript_100001 HMMER protein_hmm_match 172 234 5.800000e-03 . . ID=homology:53133;Name=zf-C2H2_6;Target=zf-C2H2_6 4 24 +;Note=C2H2-type zinc finger;accuracy=0.96;env_coords=169 240;Dbxref="Pfam:PF13912.2"
Transcript_100001 HMMER protein_hmm_match 172 234 5.800000e-04 . . ID=homology:53124;Name=zf-C2H2;Target=zf-C2H2 3 23 +;Note=Zinc finger, C2H2 type;accuracy=0.96;env_coords=166 234;Dbxref="Pfam:PF00096.22"
Transcript_100001 HMMER protein_hmm_match 211 288 4.000000e+01 . . ID=homology:53131;Name=zf-H2C2_2;Target=zf-H2C2_2 2 18 +;Note=Zinc-finger double domain;accuracy=0.85;env_coords=208 303;Dbxref="Pfam:PF13465.2"
Transcript_100001 HMMER protein_hmm_match 85 117 1.200000e+00 . . ID=homology:53129;Name=zf-H2C2_2;Target=zf-H2C2_2 15 25 +;Note=Zinc-finger double domain;accuracy=0.87;env_coords=43 120;Dbxref="Pfam:PF13465.2"
Transcript_100001 HMMER protein_hmm_match 85 147 5.900000e-02 . . ID=homology:53127;Name=zf-C2H2_jaz;Target=zf-C2H2_jaz 2 22 +;Note=Zinc-finger double-stranded RNA-binding;accuracy=0.94;env_coords=82 147;Dbxref="Pfam:PF12171.4"
dammit install from master with Transcript_0
appearing first:
ljcohen@js-168-95:/kfish_annotations$ head gff3/A_xenica.trinity_out.Trinity.fasta.dammit.gff3
##gff-version 3.2.1
Transcript_0 HMMER protein_hmm_match 19 231 5.500000e-26 . . ID=homology:440222;Name=RRM_1;Target=RRM_1 1 70 +;Note=RNA recognition motif. (a.k.a. RRM, RBD, or RNP domain);accuracy=0.99;env_coords=19 231;Dbxref="Pfam:PF00076.18"
Transcript_0 HMMER protein_hmm_match 19 231 6.200000e-15 . . ID=homology:440223;Name=RRM_6;Target=RRM_6 1 66 +;Note=RNA recognition motif (a.k.a. RRM, RBD, or RNP domain);accuracy=0.98;env_coords=19 231;Dbxref="Pfam:PF14259.2"
Transcript_0 HMMER protein_hmm_match 67 237 2.600000e-08 . . ID=homology:440224;Name=RRM_5;Target=RRM_5 3 54 +;Note=RNA recognition motif. (a.k.a. RRM, RBD, or RNP domain);accuracy=0.92;env_coords=61 243;Dbxref="Pfam:PF13893.2"
Transcript_0 LAST translated_nucleotide_match 248 775 3.200000e-62 + . ID=homology:623778;Name=sp|O93235|CIRBA_XENLA;Target=sp|O93235|CIRBA_XENLA 0 162 +;database=sprot
Transcript_0 LAST translated_nucleotide_match 248 775 3.800000e-142 + . ID=homology:480416;Name=ENSORLP00000019321;Target=ENSORLP00000019321 0 185 +;database=OrthoDB
Transcript_0 shmlast.LAST conditional_reciprocal_best_LAST 83 258 1.000000e-104 + . ID=homology:921142;Name=Funhe2EKm033401t1 oid=Funhe5EG005394t1,AUGepir8s158g37t1; aalen=169,23%,complete; type=protein; Name=Cold-inducible RNA-binding protein A (100%P); genegroup=FISH11G_G3892; Dbxref=UniProt:F1R6L3_DANRE,UniProt:CIRBP_HUMAN,OrthoDB6:EOG6ZKKX4;;Target=Funhe2EKm033401t1 oid=Funhe5EG005394t1,AUGepir8s158g37t1; aalen=169,23%,complete; type=protein; Name=Cold-inducible RNA-binding protein A (100%P); genegroup=FISH11G_G3892; Dbxref=UniProt:F1R6L3_DANRE,UniProt:CIRBP_HUMAN,OrthoDB6:EOG6ZKKX4; 0 165 +;database=kfish2rae5g.pub.aa
Transcript_0 shmlast.LAST conditional_reciprocal_best_LAST 83 258 1.100000e-124 + . ID=homology:737874;Name=ref|XP_012730079.1| cold-inducible RNA-binding protein isoform X2 [Fundulus heteroclitus];Target=ref|XP_012730079.1| cold-inducible RNA-binding protein isoform X2 [Fundulus heteroclitus] 0 181 +;database=protein.fa
Transcript_0 shmlast.LAST conditional_reciprocal_best_LAST 83 258 1.100000e-124 + . ID=homology:839596;Name=ref|XP_012730076.1| cold-inducible RNA-binding protein isoform X1 [Fundulus heteroclitus];Target=ref|XP_012730076.1| cold-inducible RNA-binding protein isoform X1 [Fundulus heteroclitus] 0 181 +;database=protein.fa
Transcript_0 shmlast.LAST conditional_reciprocal_best_LAST 83 258 3.100000e-105 + . ID=homology:1648691;Name=ENSFHEP00000001826.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811434.1:628724:631593:-1 gene:ENSFHEG00000002580.1 transcript:ENSFHET00000013224.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:cirbpa description:cold inducible RNA binding protein a [Source:ZFIN;Acc:ZDB-GENE-050417-329];Target=ENSFHEP00000001826.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811434.1:628724:631593:-1 gene:ENSFHEG00000002580.1 transcript:ENSFHET00000013224.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:cirbpa description:cold inducible RNA binding protein a [Source:ZFIN;Acc:ZDB-GENE-050417-329] 0 165 +;database=Fundulus_heteroclitus.Fundulus_heteroclitus-3.0.2.pep.all.fa
Update: problem was caused by existence of semicolon-delimited key-value pairs in custom protein database headers. Fixed with a custom script. Note for future version: perform rename on custom protein databases to avoid headaches, OR produce a warning if semicolons are detected in the custom databases.
@camillescott 's wizardry fixing the problem: https://gist.github.com/camillescott/51e3663d5c09db67d9899e4bc3c55266
Thank you!
Something weird is going on with the
GFF3Parser
. I think something is happening on df concatenation step?These are gff3 generated with master version of dammit, new shmlast=1.4 and 3 custom amino acid databases.
Installation from master with these instructions) on blank Jetstream, Ubuntu 18.04.
Command:
Output:
The dataframe looks like this,
annotations.head()
:pd.__version__
I've tried downgrading pandas versions, downgrading dammit to
conda install
version. All give same weird dataframe output.