pandas sorting, GFF3Parser problem

johnsolk commented 5 years ago

Something weird is going on with the GFF3Parser. I think something is happening on df concatenation step?

These are gff3 generated with master version of dammit, new shmlast=1.4 and 3 custom amino acid databases.

Installation from master with these instructions) on blank Jetstream, Ubuntu 18.04.

Command:

annotations = GFF3Parser(filename="gff3/A_xen_test.gff3").read()

Output:

/opt/miniconda3/envs/dammit_working_gff3/lib/python3.6/site-packages/dammit-1.1-py3.6.egg/dammit/fileio/base.py:79: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat(self, ignore_index=True)

The dataframe looks like this, annotations.head():

  | 0 100 + | 0 1000 + | 0 1002 + | 0 1004 + | 0 1005 + | 0 1006 + | 0 1007 + | 0 1008 + | 0 101 + | ... | score | seqid | source | start | strand | strand_x | strand_y | trunc | type_x | type_y
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.500000e-26 | Transcript_0 | HMMER | 18 | NaN | NaN | NaN | NaN | protein_hmm_match | NaN
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.200000e-15 | Transcript_0 | HMMER | 18 | NaN | NaN | NaN | NaN | protein_hmm_match | NaN
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.600000e-08 | Transcript_0 | HMMER | 66 | NaN | NaN | NaN | NaN | protein_hmm_match | NaN
NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.200000e-62 | Transcript_0 | LAST | 247 | NaN | + | NaN | NaN | translated_nucleotide_match | NaN
NaN | NaN | NaN | NaN | Na

pd.__version__

'0.23.4'

# dammit
## a tool for easy de novo transcriptome annotation

by Camille Scott

**v1.1**, 2018

usage: dammit [-h] [--debug] [--version] {migrate,databases,annotate} ...

# dammit: a tool for easy de novo transcriptome annotation

optional arguments:
  -h, --help            show this help message and exit
  --debug
  --version             show program's version number and exit

dammit subcommands:
  {migrate,databases,annotate}
    databases           Check for databases and optionally download and
                        prepare them for use. By default, only check their
                        status.
    annotate            The main annotation pipeline. Calculates assembly
                        stats; runs BUSCO; runs LAST against OrthoDB (and
                        optionally uniref90), HMMER against Pfam, Inferal
                        against Rfam, and Conditional Reciprocal Best-hit
                        Blast against user databases; and aggregates all
                        results in a properly formatted GFF3 file.

I've tried downgrading pandas versions, downgrading dammit to conda install version. All give same weird dataframe output.

johnsolk commented 5 years ago

Ah. It's a problem with the gff3 file itself generated with the master version of dammit. An old gff3 file does just fine.

Command:

annotations = GFF3Parser(filename="A_xenica.trinity_out.Trinity.fasta.dammit.gff3").read()

No error. Output from annotations.head()

Dbxref  ID  Name    Note    Parent  Target  accuracy    bitscore    database    end env_coords  phase   score   seqid   source  start   strand  trunc   type
0   NaN homology:425062 90820   NaN NaN 90820 1132 1240 +   NaN NaN OrthoDB 366 NaN NaN 5.500000e-47    Transcript_100000   LAST    36  +   NaN translated_nucleotide_match
1   "Pfam:PF13465.2"    homology:53130  zf-H2C2_2   Zinc-finger double domain   NaN zf-H2C2_2 1 25 +    0.94    NaN NaN 198 127 201 NaN 1.200000e-03    Transcript_100001   HMMER   126 NaN NaN protein_hmm_match
2   "Pfam:PF13894.2"    homology:53126  zf-C2H2_4   C2H2-type zinc finger   NaN zf-C2H2_4 2 23 +    0.94    NaN NaN 234 166 237 NaN 4.000000e-03    Transcript_100001   HMMER   168 NaN NaN protein_hmm_match
3   "Pfam:PF12171.4"    homology:53128  zf-C2H2_jaz Zinc-finger double-stranded RNA-binding NaN zf-C2H2_jaz 4 23 +  0.93    NaN NaN 231 172 240 NaN 2.600000e-03    Transcript_100001   HMMER   171 NaN NaN protein_hmm_match
4   "Pfam:PF13912.2"    homology:53133  zf-C2H2_6   C2H2-type zinc finger   NaN zf-C2H2_6 4 24 +    0.96    NaN NaN 234 169 240 NaN 5.800000e-03    Transcript_100001   HMMER   171 NaN NaN protein_hmm_match

johnsolk commented 5 years ago

Any ideas how to fix the gff3 file after running dammit? Would prefer not to run dammit again on 17 of these.

Here's an example:

curl -L https://osf.io/bz5de/download -o F_diaphanus.trinity_out.Trinity.fasta.dammit.gff3

johnsolk commented 5 years ago

Has something changed with the way transcripts are sorted? Could this be the issue?

Old version of dammit with Transcript_100000 appearing first:

ljcohen@js-168-95:/kfish_annotations$ head A_xenica.trinity_out.Trinity.fasta.dammit.gff3
##gff-version 3.2.1
Transcript_100000   LAST    translated_nucleotide_match 37  366 5.500000e-47    +   .   ID=homology:425062;Name=90820;Target=90820 1132 1240 +;database=OrthoDB
Transcript_100001   HMMER   protein_hmm_match   127 198 1.200000e-03    .   .   ID=homology:53130;Name=zf-H2C2_2;Target=zf-H2C2_2 1 25 +;Note=Zinc-finger double domain;accuracy=0.94;env_coords=127 201;Dbxref="Pfam:PF13465.2"
Transcript_100001   HMMER   protein_hmm_match   169 234 4.000000e-03    .   .   ID=homology:53126;Name=zf-C2H2_4;Target=zf-C2H2_4 2 23 +;Note=C2H2-type zinc finger;accuracy=0.94;env_coords=166 237;Dbxref="Pfam:PF13894.2"
Transcript_100001   HMMER   protein_hmm_match   172 231 2.600000e-03    .   .   ID=homology:53128;Name=zf-C2H2_jaz;Target=zf-C2H2_jaz 4 23 +;Note=Zinc-finger double-stranded RNA-binding;accuracy=0.93;env_coords=172 240;Dbxref="Pfam:PF12171.4"
Transcript_100001   HMMER   protein_hmm_match   172 234 5.800000e-03    .   .   ID=homology:53133;Name=zf-C2H2_6;Target=zf-C2H2_6 4 24 +;Note=C2H2-type zinc finger;accuracy=0.96;env_coords=169 240;Dbxref="Pfam:PF13912.2"
Transcript_100001   HMMER   protein_hmm_match   172 234 5.800000e-04    .   .   ID=homology:53124;Name=zf-C2H2;Target=zf-C2H2 3 23 +;Note=Zinc finger, C2H2 type;accuracy=0.96;env_coords=166 234;Dbxref="Pfam:PF00096.22"
Transcript_100001   HMMER   protein_hmm_match   211 288 4.000000e+01    .   .   ID=homology:53131;Name=zf-H2C2_2;Target=zf-H2C2_2 2 18 +;Note=Zinc-finger double domain;accuracy=0.85;env_coords=208 303;Dbxref="Pfam:PF13465.2"
Transcript_100001   HMMER   protein_hmm_match   85  117 1.200000e+00    .   .   ID=homology:53129;Name=zf-H2C2_2;Target=zf-H2C2_2 15 25 +;Note=Zinc-finger double domain;accuracy=0.87;env_coords=43 120;Dbxref="Pfam:PF13465.2"
Transcript_100001   HMMER   protein_hmm_match   85  147 5.900000e-02    .   .   ID=homology:53127;Name=zf-C2H2_jaz;Target=zf-C2H2_jaz 2 22 +;Note=Zinc-finger double-stranded RNA-binding;accuracy=0.94;env_coords=82 147;Dbxref="Pfam:PF12171.4"

dammit install from master with Transcript_0 appearing first:

ljcohen@js-168-95:/kfish_annotations$ head gff3/A_xenica.trinity_out.Trinity.fasta.dammit.gff3 
##gff-version 3.2.1
Transcript_0    HMMER   protein_hmm_match   19  231 5.500000e-26    .   .   ID=homology:440222;Name=RRM_1;Target=RRM_1 1 70 +;Note=RNA recognition motif. (a.k.a. RRM, RBD, or RNP domain);accuracy=0.99;env_coords=19 231;Dbxref="Pfam:PF00076.18"
Transcript_0    HMMER   protein_hmm_match   19  231 6.200000e-15    .   .   ID=homology:440223;Name=RRM_6;Target=RRM_6 1 66 +;Note=RNA recognition motif (a.k.a. RRM, RBD, or RNP domain);accuracy=0.98;env_coords=19 231;Dbxref="Pfam:PF14259.2"
Transcript_0    HMMER   protein_hmm_match   67  237 2.600000e-08    .   .   ID=homology:440224;Name=RRM_5;Target=RRM_5 3 54 +;Note=RNA recognition motif. (a.k.a. RRM, RBD, or RNP domain);accuracy=0.92;env_coords=61 243;Dbxref="Pfam:PF13893.2"
Transcript_0    LAST    translated_nucleotide_match 248 775 3.200000e-62    +   .   ID=homology:623778;Name=sp|O93235|CIRBA_XENLA;Target=sp|O93235|CIRBA_XENLA 0 162 +;database=sprot
Transcript_0    LAST    translated_nucleotide_match 248 775 3.800000e-142   +   .   ID=homology:480416;Name=ENSORLP00000019321;Target=ENSORLP00000019321 0 185 +;database=OrthoDB
Transcript_0    shmlast.LAST    conditional_reciprocal_best_LAST    83  258 1.000000e-104   +   .   ID=homology:921142;Name=Funhe2EKm033401t1 oid=Funhe5EG005394t1,AUGepir8s158g37t1; aalen=169,23%,complete; type=protein; Name=Cold-inducible RNA-binding protein A (100%P); genegroup=FISH11G_G3892; Dbxref=UniProt:F1R6L3_DANRE,UniProt:CIRBP_HUMAN,OrthoDB6:EOG6ZKKX4;;Target=Funhe2EKm033401t1 oid=Funhe5EG005394t1,AUGepir8s158g37t1; aalen=169,23%,complete; type=protein; Name=Cold-inducible RNA-binding protein A (100%P); genegroup=FISH11G_G3892; Dbxref=UniProt:F1R6L3_DANRE,UniProt:CIRBP_HUMAN,OrthoDB6:EOG6ZKKX4; 0 165 +;database=kfish2rae5g.pub.aa
Transcript_0    shmlast.LAST    conditional_reciprocal_best_LAST    83  258 1.100000e-124   +   .   ID=homology:737874;Name=ref|XP_012730079.1| cold-inducible RNA-binding protein isoform X2 [Fundulus heteroclitus];Target=ref|XP_012730079.1| cold-inducible RNA-binding protein isoform X2 [Fundulus heteroclitus] 0 181 +;database=protein.fa
Transcript_0    shmlast.LAST    conditional_reciprocal_best_LAST    83  258 1.100000e-124   +   .   ID=homology:839596;Name=ref|XP_012730076.1| cold-inducible RNA-binding protein isoform X1 [Fundulus heteroclitus];Target=ref|XP_012730076.1| cold-inducible RNA-binding protein isoform X1 [Fundulus heteroclitus] 0 181 +;database=protein.fa
Transcript_0    shmlast.LAST    conditional_reciprocal_best_LAST    83  258 3.100000e-105   +   .   ID=homology:1648691;Name=ENSFHEP00000001826.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811434.1:628724:631593:-1 gene:ENSFHEG00000002580.1 transcript:ENSFHET00000013224.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:cirbpa description:cold inducible RNA binding protein a [Source:ZFIN;Acc:ZDB-GENE-050417-329];Target=ENSFHEP00000001826.1 pep primary_assembly:Fundulus_heteroclitus-3.0.2:KN811434.1:628724:631593:-1 gene:ENSFHEG00000002580.1 transcript:ENSFHET00000013224.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:cirbpa description:cold inducible RNA binding protein a [Source:ZFIN;Acc:ZDB-GENE-050417-329] 0 165 +;database=Fundulus_heteroclitus.Fundulus_heteroclitus-3.0.2.pep.all.fa

camillescott commented 5 years ago

Update: problem was caused by existence of semicolon-delimited key-value pairs in custom protein database headers. Fixed with a custom script. Note for future version: perform rename on custom protein databases to avoid headaches, OR produce a warning if semicolons are detected in the custom databases.

johnsolk commented 5 years ago

@camillescott 's wizardry fixing the problem: https://gist.github.com/camillescott/51e3663d5c09db67d9899e4bc3c55266

Thank you!

dib-lab / dammit

pandas sorting, GFF3Parser problem #139