barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
149 stars 21 forks source link

Error with PGAP annotation #262

Closed juliofdiaz closed 3 years ago

juliofdiaz commented 3 years ago

I have annotated a genome using NCBI's PGAP. Then I tried using the resulting gff or gbk from this annotation as input for breseq (v 0.35.5), and I get the following errors:

GBK

---> bowtie2  :: version 2.2.6 [/system/software/linux-x86_64/bowtie2/2.2.6/bin/bowtie2]
---> R        :: version 3.5.2 [/system/software/centos_7/R/3.5.2__intel-2018/bin/R]
+++   NOW PROCESSING Read and reference sequence file input
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Seq id  does not contain any valid alphanumeric characters.
FILE: libbreseq/reference_sequence.h   LINE: 1192
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Backtrace with 10 stack frames.
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x577f78]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x59ee10]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x58b316]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x58bf89]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x58fdb7]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x590285]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x42dbc0]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x40af6b]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd30232b505]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x41ce47]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

GFF

---> bowtie2  :: version 2.2.6 [/system/software/linux-x86_64/bowtie2/2.2.6/bin/bowtie2]
---> R        :: version 3.5.2 [/system/software/centos_7/R/3.5.2__intel-2018/bin/R]
+++   NOW PROCESSING Read and reference sequence file input
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Attempt to convert empty string
FILE: libbreseq/common.h   LINE: 796
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Backtrace with 9 stack frames.
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x41d1f2]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x43f841]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x58ddc3]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x58ff1c]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x590285]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x42dbc0]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x40af6b]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f27b683c505]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x41ce47]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I do not get the error if I use the pasta file with the genome. Any clues as to what the problem is? When I use the prokka gff or gbk, I also do not get the error.

jeffreybarrick commented 3 years ago

Can you post the first few lines of the GBF and GFF files? breseq looks for the seq_id there, and it seems to be encountering something unexpected.

juliofdiaz commented 3 years ago

GBK

LOCUS       EP67_contig_1          40698 bp    DNA     linear   BCT 26-APR-2021
DEFINITION  Pseudomonas aeruginosa chromosome, whole genome shotgun sequence.
ACCESSION   
VERSION
KEYWORDS    WGS.
SOURCE      Pseudomonas aeruginosa
  ORGANISM  Pseudomonas aeruginosa
            Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales;
            Pseudomonadaceae; Pseudomonas.
REFERENCE   1  (bases 1 to 40698)
  AUTHORS   Diaz Caballero,J.F.
  CONSRTM   COMBACTE_MAGNET
  TITLE     Direct Submission
  JOURNAL   Submitted (26-APR-2021) Department of Zoology, University of
            Oxford, 11a Mansfield Road, Oxford, Oxfordshire OX2 9DF, UK

GFF

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region  1 40698
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=287
EP67_contig_1   Local   region  1   40698   .   +   .   ID=EP67_contig_1:1..40698;Dbxref=taxon:287;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA
EP67_contig_1   .   gene    1   94  .   +   .   ID=gene-pgaptmp_000001;Name=pgaptmp_000001;gbkey=Gene;gene_biotype=protein_coding;locus_tag=pgaptmp_000001;partial=true;start_range=.,1
EP67_contig_1   Protein Homology    CDS 1   94  .   +   1   ID=cds-pgaptmp_000001;Parent=gene-pgaptmp_000001;Name=extdb:pgaptmp_000001;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000679427.1;locus_tag=pgaptmp_000001;partial=true;product=QacE family quaternary ammonium compound efflux SMR transporter;protein_id=extdb:pgaptmp_000001;start_range=.,1;transl_table=11
EP67_contig_1   .   gene    88  927 .   +   .   ID=gene-pgaptmp_000002;Name=sul1;gbkey=Gene;gene=sul1;gene_biotype=protein_coding;locus_tag=pgaptmp_000002
EP67_contig_1   Protein Homology    CDS 88  927 .   +   0   ID=cds-pgaptmp_000002;Parent=gene-pgaptmp_000002;Name=extdb:pgaptmp_000002;gbkey=CDS;gene=sul1;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000259031.1;locus_tag=pgaptmp_000002;product=sulfonamide-resistant dihydropteroate synthase Sul1;protein_id=extdb:pgaptmp_000002;transl_table=11
EP67_contig_1   .   gene    1055    1555    .   +   .   ID=gene-pgaptmp_000003;Name=pgaptmp_000003;gbkey=Gene;gene_biotype=protein_coding;locus_tag=pgaptmp_000003
EP67_contig_1   Protein Homology    CDS 1055    1555    .   +   0   ID=cds-pgaptmp_000003;Parent=gene-pgaptmp_000003;Name=extdb:pgaptmp_000003;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_004194535.1;locus_tag=pgaptmp_000003;product=GNAT family N-acetyltransferase;protein_id=extdb:pgaptmp_000003;transl_table=11
jeffreybarrick commented 3 years ago

Ok. It looks like this was fixed/addressed for the GenBank format in #252.

For now, the workaround would be to delete the empty VERSION and ACCESSION lines from the GenBank. This edited GenBank file should work with v0.35.5.

I have a fix in the current repository code for the GenBank problem, but the GFF3 case still fails. After I add a fix for that I will roll a new v0.35.6 version. With that you should be able to use the PGAP-generated files directly in breseq without needing to edit them.

juliofdiaz commented 3 years ago

I removed VERSION and accession, so gsk looks like this:

LOCUS       EP67_contig_1          40698 bp    DNA     linear   BCT 26-APR-2021
DEFINITION  Pseudomonas aeruginosa chromosome, whole genome shotgun sequence.
KEYWORDS    WGS.
SOURCE      Pseudomonas aeruginosa
  ORGANISM  Pseudomonas aeruginosa
            Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales;
            Pseudomonadaceae; Pseudomonas.
REFERENCE   1  (bases 1 to 40698)
  AUTHORS   Diaz Caballero,J.F.

and I get this error:

---> bowtie2  :: version 2.2.6 [/system/software/linux-x86_64/bowtie2/2.2.6/bin/bowtie2]
---> R        :: version 3.5.2 [/system/software/centos_7/R/3.5.2__intel-2018/bin/R]
+++   NOW PROCESSING Read and reference sequence file input
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Could not determine format of reference file: /data/zool-combacte-magnet/zool2417/results/pgap/pb1-ca-ci-pi/EP67/EP67_2.gbk
FILE: reference_sequence.cpp   LINE: 843
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Backtrace with 7 stack frames.
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x577f78]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x5900b2]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x590285]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x42dbc0]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x40af6b]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd189f73505]
/data/zool-combacte-magnet/zool2417/bin/breseq-0.35.5/bin/breseq() [0x41ce47]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
jeffreybarrick commented 3 years ago

I'm not sure what's going on in that case. When I make a header like that one (but faked onto a different sequence file) it works in v0.35.6.

Can you try this new version with your original GenBank file? https://github.com/barricklab/breseq/releases/tag/v0.35.6

If it doesn't work, I might need you to email me the full GenBank file at the address in the breseq command line output header.

Thanks!

juliofdiaz commented 3 years ago

Thanks for looking into this. I can report that v 0.35.6 fixes the problem.

jeffreybarrick commented 3 years ago

Thanks!