barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
142 stars 21 forks source link

Errors: Parse error at line #### and terminating wit uncaught exception of type std::out_of_range:basic_string Abort trap:6 #239

Closed joshborin closed 4 years ago

joshborin commented 4 years ago

Hello! I'm a breseq noob and haven't been able to figure out what my issue is. My fastq files are a little unique so I'll describe them first:

To save money, I've paired single E coli and phage lambda clones together in each sample and sent these for 150Mb sequencing at MiGS. The plan was to run breseq twice for each sample, aligning once to the bacterial reference REL606 and once to the lambda reference. I made sure to have 50x more phage DNA in the sample tube given that the E coli genome is ~100x larger in order to balance coverage.

When I did the Test_Drive I don't seem to have any issues. When I run my paired end fastq files with the lambda ref genome it seems to properly detect mutations! However I get errors when using the bacterial REL606 ref. I have tried running on consensus and polymorphism modes and the error appears to be the same. I have created and specified blank output folders but this doesn't fix the issue.

I'm not sure what info in the terminal log is most relevant. I get 'parse error at line ...' at many many positions. However the code stops running at NOW PROCESSING Output:

+++ NOW PROCESSING Output Creating merged genome diff evidence file... Predicting mutations from evidence... libc++abi.dylib: terminating with uncaught exception of type std::out_of_range: basic_string Abort trap: 6 (base) meyer-144-46:Sample1-bacteria meyerlab$

Any advice you have on how to troubleshoot would be greatly appreciated!!

joshborin commented 4 years ago

Dropping the full Terminal Logs for breseq runs in consensus and polymorphism modes log_consensus_error.txt [log_poly_error.txt] (https://github.com/barricklab/breseq/files/4684182/log_poly_error.txt)

jeffreybarrick commented 4 years ago

Hi Josh, It looks like you are using an old version of breseq (v0.30.0). We have fixed some bugs that look like this since then. Can you see if the same error occurs when you use v0.35.1?

dannagifford commented 4 years ago

Not directly related to the error, but on your wanting to call mutations both in the phage and the bacterial genomes: I believe breseq can handle multiple reference genomes? You can separately specify REL606 and the phage genome by passing two -r arguments to breseq, e.g. breseq -r REL606.gbk -r phage.gbk reads1.fastq.gz reads2.gz

Then you only need run breseq once per genome. Either way you do it, you also need to hope there isn't much in the way of homology between your phage genome and any prophages in REL606, otherwise you can get wonky things that look like SNPs at some polymorphic frequency. (I wonder actually if specifying both genomes at once might actually help with that issue, as any homologous sequences should hopefully then match better to where they actually belong... sorry for all the edits, whomever is subscribed to the github issues.)

Hope it helps...

Danna

jeffreybarrick commented 4 years ago

That's all good advice @dannagifford.

Running breseq once with both references should minimize any cross-talk in terms of inadvertent incorrect mapping between prophages and phage, but there is still that possibility if their sequences are similar. In the most extreme case in which they are very similar, breseq won't be able to call mutations in either. You can recognize that this will be the case b/c they will show up in "red" versus in "blue" on the coverage graphs (as regions where reads ambiguously map to multiple locations: E. coli genome and phage in this case).

dannagifford commented 4 years ago

@jeffreybarrick I've wondered if the mate pair info would be able to help resolve issues like that, as if the first mate maps to the phage, and the second to the prophage or elsewhere in the chromosome, one could filter those first mates? IIRC breseq doesn't take mate pairing into account, though I have found such info useful in a previous genomics pipeline I've used for Pseudomonas (breseq is just way to easy to use though...)

jeffreybarrick commented 4 years ago

@dannagifford: It could certainly help disambiguate things in this case and others.

You are correct that breseq does not take paired-reads information into account when mapping. We have been slow to try implementing this b/c we expect it will take quite a bit of effort to be sure it doesn't mess up the analysis of split-read mapping, but it's on the list:

https://github.com/barricklab/breseq/issues/214

A halfway solution that is available in the meantime is to map the reads on your own and then supply them via the --aligned-sam option.

joshborin commented 4 years ago

Thank you both for the useful feedback!!! I updated breseq, bowtie2, TBB, etc. and now it works great! facepalm

On the topic of using 2 reference genomes...I anticipated this might be an issue. For some of my paired bacteria and phage, the phage is known to have recombined with a latent prophage in REL606. When I align an evolved clone of REL606 with only -r REL606 the recombined mutations show up as 'predicted' in the E coli genome but they are recombinase/excisionase. When I run the evolved REL606 strain with -r REL606 and -r lambda, the 'predicted mutations' table shows combined E coli and lambda mutations--essentially the same mutations but annotated for the lambda genome instead of E coli.

Jeff, I'm not sure what you mean by looking at the red and blue in the coverage graphs. It seems like, because parts of REL606 are homologous with lambda, this attempt to cut sequencing costs won't work :/

jeffreybarrick commented 4 years ago

Can you post the coverage graph (linked from summary page) for the lambda genome from the -r REL606 and -r lambda run? The one that looks like this one:

https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/REL8593A_output/evidence/REL606.overview.png

If it is calling mutations in the phage when you run with both references, then the reads it is using for that must be aligning better to phage lambda than they are to the similar prophage, so it may be possible to disentangle things...

joshborin commented 4 years ago

Sure!

I've attached the summaries from the bacteria/phage pair aligned:

  1. -r REL606
  2. -r lambda (the NCBI lambda)
  3. -r cI26 (the lambda ancestor) 4a. -r REL606 -r lambda mapped to REL606 4b. -r REL606 -r lambda mapped to lambda

The pair of strains in this sample are a 30-day coevolved REL606 and a 28-day phage coevolved from lambda strain cI26. This phage recombined with a latent prophage in REL606 in Justin Meyer's 2012 coevol experiment, so, of all my samples, this pair would be the most difficult to disentangle.

1-REL606-align_coverage 2-lambda-align_coverage 3-cI26-align-coverage 4a-REL606-lambda-alignedto606 4b-REL606-lambda-alignedtolambda

jeffreybarrick commented 4 years ago

Oh. That looks pretty good–like it is able to map the reads unambiguously to the phage genome. The small reddish parts are regions where a read maps there and somewhere else equally. The bluish colors mean the reads only map there. If you know the missing phage regions are necessary for phage replication and can't really be deleted, then you can also probably safely infer that they are the areas that recombined with the prophage => because the reads that would map there are not shifted to map best to the REL606 genome. You might be able to see increased coverage of those areas if you created a graph that zoomed in on the prophage by using a command like this (from a breseq run directory):

breseq bam2seq REL606:start-end

(Change REL606 to whatever is in the seq_id column of the results and start/end to the region you want to "magnify".)

joshborin commented 4 years ago

I haven't been able to get that command to run correctly. When I can get it to run, it says that a bunch of processes are 'ALREADY COMPLETE'. I greatly appreciate all the time and effort you've devoted to helping me solve this issue but it doesn't seem like pairing bacteria with phage in attempt to cut sequencing costs will work--because REL606 and my strains of lambda share genes due to a recombination event :/

jeffreybarrick commented 4 years ago

Whoops! The command should be this if you want to give it another try.


breseq bam2cov REL606:start-end