barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
138 stars 21 forks source link

Multi-sample calls? #145

Closed cjfields closed 6 years ago

cjfields commented 6 years ago

We've been using breseq for calling individual samples from a longitudinal study, but have run into issues when comparing results from multiple samples. This seems to partly be a threshold effect, where low-freq alleles are called in some samples but missed in others even though read evidence is present.

We can go back and recall frequencies using something like samtools pileup + bcftools, but we would like to be as consistent as possible when running related samples like this. Is there a way that we could either run breseq with multiple samples to 'rescue' regions that have consistent evidence across samples (somewhat like GATK's multi-sample calls)? Alternatively, can we point breseq to focus calls from specific regions to get allele frequencies?

jeffreybarrick commented 6 years ago

The way that we've accomplished this is before to use the --user-evidence option as follows:

(1) Run breseq on each sample separately

(2) Merge the output.gd files from all related runs to create a list of mutations called in any sample:

gdtools merge -o merged.gd input1.gd input 2.gd ...

Be sure to use the -e flag so that merge operates on evidence (2-letter codes) instead of on mutations (3-letter codes).

(3) Re-run breseq on each sample adding --user-evidence merged.gd to the command line options.

Now the output will include evidence for every mutation ever seen in any one sample (reads for/against, frequency, etc.) regardless of whether it passes the normal requirements for prediction in the one sample being analyzed.

Let me know if this doesn't work or you notice anything odd when trying it.

cjfields commented 6 years ago

@jeffreybarrick thanks I'll give that a try

cjfields commented 6 years ago

As noted in #144 I'm getting an error at the breseq step using v.0.31.1. I should note, the original per-sample calls were made using an older version of breseq, so I may retry this using v.0.32.0.

EDIT: just to add, the errors are consistent across samples, so maybe redundancy from the evidence information in the merging step?

Merging step:

[cjfields@compute-0-1 2018-01-20-gdtools-merge]$ gdtools merge -e -o all-merged.gd *.gd
In test mode. Program data path: /home/groups/hpcbio/apps/breseq/breseq-0.31.1-Linux-x86_64/share/breseq
================================================================================
breseq 0.31.1  revision 0142dbc81ef1   http://barricklab.org/breseq

Active Developers: Barrick JE, Deatherage DE
Contact:           <jeffrey.e.barrick@gmail.com>

breseq is free software; you can redistribute it and/or modify it under the
terms the GNU General Public License as published by the Free Software
Foundation; either version 2, or (at your option) any later version.

Copyright (c) 2008-2010 Michigan State University
Copyright (c) 2011-2017 The University of Texas at Austin

If you use breseq in your research, please cite:

  Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations
  in laboratory-evolved microbes from next-generation sequencing
  data using breseq. Methods Mol. Biol. 1151: 165–188.

If you use structural variation (junction) predictions, please cite:

  Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,
  Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G.
  (2014) Identifying structural variation in haploid microbial genomes
  from short-read resequencing data using breseq. BMC Genomics 15:1039.
================================================================================

*** Begin UNION ***

    Preserving: Evidence (2-letter codes)

    Reading input GD files

        G1841MIC.gd
        G311MIC.gd
        G413MIC.gd
        G923MIC.gd
        GBCComposite1.gd
        GBCComposite2.gd
        LangsampleDay18.gd
        LangsampleDay9.gd
        P1877MIC.gd
        P209MIC.gd
        P314MIC.gd
        P941MIC.gd
        R1823MIC.gd
        R510MIC.gd
        R611MIC.gd
        R914MIC.gd

    Assigning unique IDs

    Writing output GD file: all-merged.gd

*** End UNION ***

Independent breseq runs (run on a cluster using SLURM); the data are paired-end.

breseq -j $SLURM_CPUS_PER_TASK -p -o $SAMPLENAME -n $SAMPLENAME \
    -r $REFERENCE \
    --brief-html-output \
    --user-evidence-gd all-merged.gd \
    $READ1 $READ2
cjfields commented 6 years ago

I'm retrying the above using gdtools union to remove duplicates instead of gdtools merge, primarily to see if this addresses the problem.

jeffreybarrick commented 6 years ago

Erk... In the newest version MERGE and UNION are the same thing (UNION) despite what the main gdtools help says.

You are getting the errors after you MERGE/UNION your initial GD files? Do you know which entries are being marked as duplicates?

cjfields commented 6 years ago

That explains why the output for the is exactly the same 😄

Yes, there appear to be several (see below for the full error). These were generated from merged data from our original per-sample breseq runs from this summer, using v0.28.1. Is it worth testing the latest release from scratch?

EDIT: I should add, this seems to occur in 05_alignment_correction. Full text for the error log is attached below

run-log.txt

>>> Error(s) in GenomeDiff format. FILE:  <<<
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>134<tab>.<tab>NZ_JRYM01000001<tab>221437<tab>-1<tab>NZ_JRYM01000050<tab>80<tab>-1<tab>0<tab>alignment_overlap=56<tab>coverage_minus=68<tab>coverage_plus=36<tab>flanking_left=250<tab>flanking_right=80<tab>key=NZ_JRYM01000001__221437__-1__NZ_JRYM01000050__136__-1__56____250__80__0__0<tab>max_left=188<tab>max_left_minus=188<tab>max_left_plus=188<tab>max_min_left=78<tab>max_min_left_minus=78<tab>max_min_left_plus=0<tab>max_min_right=80<tab>max_min_right_minus=80<tab>max_min_right_plus=80<tab>max_pos_hash_score=340<tab>max_right=80<tab>max_right_minus=80<tab>max_right_plus=80<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=74<tab>side_1_continuation=0<tab>side_1_overlap=56<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=104
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>164<tab>.<tab>NZ_JRYM01000001<tab>221437<tab>-1<tab>NZ_JRYM01000050<tab>80<tab>-1<tab>0<tab>alignment_overlap=56<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000001__221437__-1__NZ_JRYM01000050__136__-1__56____250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=340<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=56<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>user_defined=1
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>118<tab>.<tab>NZ_JRYM01000011<tab>105<tab>-1<tab>NZ_JRYM01000052<tab>1123<tab>-1<tab>-54<tab>alignment_overlap=-54<tab>coverage_minus=10<tab>coverage_plus=9<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000011__105__-1__NZ_JRYM01000052__1123__-1__-54__GGAAGGTGCGAACAAGTCCCTGATATGAGATCATGTTTGTCATCTGGAGCCATA__250__250__0__0__UD<tab>max_left=189<tab>max_left_minus=189<tab>max_left_plus=189<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=20<tab>max_min_right_minus=20<tab>max_min_right_plus=20<tab>max_pos_hash_score=344<tab>max_right=20<tab>max_right_minus=20<tab>max_right_plus=20<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=13<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=19<tab>unique_read_sequence=GGAAGGTGCGAACAAGTCCCTGATATGAGATCATGTTTGTCATCTGGAGCCATA<tab>user_defined=1
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>131<tab>.<tab>NZ_JRYM01000011<tab>105<tab>-1<tab>NZ_JRYM01000052<tab>1123<tab>-1<tab>-54<tab>alignment_overlap=-54<tab>coverage_minus=1<tab>coverage_plus=10<tab>flanking_left=105<tab>flanking_right=250<tab>key=NZ_JRYM01000011__105__-1__NZ_JRYM01000052__1123__-1__-54__GGAAGGTGCGAACAAGTCCCTGATATGAGATCATGTTTGTCATCTGGAGCCATA__105__250__0__0<tab>max_left=84<tab>max_left_minus=71<tab>max_left_plus=84<tab>max_min_left=84<tab>max_min_left_minus=71<tab>max_min_left_plus=84<tab>max_min_right=83<tab>max_min_right_minus=0<tab>max_min_right_plus=83<tab>max_pos_hash_score=344<tab>max_right=90<tab>max_right_minus=84<tab>max_right_plus=90<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=11<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=11<tab>unique_read_sequence=GGAAGGTGCGAACAAGTCCCTGATATGAGATCATGTTTGTCATCTGGAGCCATA
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>117<tab>.<tab>NZ_JRYM01000022<tab>40638<tab>1<tab>NZ_JRYM01000037<tab>35329<tab>-1<tab>0<tab>alignment_overlap=19<tab>coverage_minus=77<tab>coverage_plus=129<tab>flanking_left=136<tab>flanking_right=250<tab>key=NZ_JRYM01000022__40638__1__NZ_JRYM01000037__35348__-1__19____136__250__0__0<tab>max_left=115<tab>max_left_minus=115<tab>max_left_plus=115<tab>max_min_left=115<tab>max_min_left_minus=114<tab>max_min_left_plus=115<tab>max_min_right=114<tab>max_min_right_minus=114<tab>max_min_right_plus=113<tab>max_pos_hash_score=414<tab>max_right=225<tab>max_right_minus=222<tab>max_right_plus=225<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=137<tab>side_1_continuation=0<tab>side_1_overlap=19<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=206
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>162<tab>.<tab>NZ_JRYM01000022<tab>40638<tab>1<tab>NZ_JRYM01000037<tab>35329<tab>-1<tab>0<tab>alignment_overlap=19<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000022__40638__1__NZ_JRYM01000037__35348__-1__19____250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=414<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=19<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>user_defined=1
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>121<tab>.<tab>NZ_JRYM01000037<tab>249<tab>-1<tab>NZ_JRYM01000087<tab>2796<tab>1<tab>0<tab>alignment_overlap=0<tab>coverage_minus=107<tab>coverage_plus=84<tab>flanking_left=249<tab>flanking_right=175<tab>key=NZ_JRYM01000037__249__-1__NZ_JRYM01000087__2796__1__0____249__175__1__0<tab>max_left=221<tab>max_left_minus=220<tab>max_left_plus=221<tab>max_min_left=124<tab>max_min_left_minus=119<tab>max_min_left_plus=124<tab>max_min_right=125<tab>max_min_right_minus=125<tab>max_min_right_plus=122<tab>max_pos_hash_score=452<tab>max_right=158<tab>max_right_minus=158<tab>max_right_plus=158<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=134<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=1<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=191
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>161<tab>.<tab>NZ_JRYM01000037<tab>249<tab>-1<tab>NZ_JRYM01000087<tab>2796<tab>1<tab>0<tab>alignment_overlap=0<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000037__249__-1__NZ_JRYM01000087__2796__1__0____250__250__1__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=450<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=1<tab>side_2_continuation=1<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>user_defined=1
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>109<tab>.<tab>NZ_JRYM01000040<tab>54771<tab>1<tab>NZ_JRYM01000062<tab>1<tab>1<tab>-54<tab>alignment_overlap=-54<tab>coverage_minus=153<tab>coverage_plus=171<tab>flanking_left=225<tab>flanking_right=250<tab>key=NZ_JRYM01000040__54771__1__NZ_JRYM01000062__1__1__-54__ACCCCAGAACTTACTTATGCTGATTCCGGTGCGAAAATTGTTAATAAAGGTACT__225__250__0__0<tab>max_left=190<tab>max_left_minus=186<tab>max_left_plus=190<tab>max_min_left=96<tab>max_min_left_minus=96<tab>max_min_left_plus=90<tab>max_min_right=98<tab>max_min_right_minus=98<tab>max_min_right_plus=98<tab>max_pos_hash_score=344<tab>max_right=189<tab>max_right_minus=189<tab>max_right_plus=188<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=209<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=324<tab>unique_read_sequence=ACCCCAGAACTTACTTATGCTGATTCCGGTGCGAAAATTGTTAATAAAGGTACT
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>158<tab>.<tab>NZ_JRYM01000040<tab>54771<tab>1<tab>NZ_JRYM01000062<tab>1<tab>1<tab>-54<tab>alignment_overlap=-54<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000040__54771__1__NZ_JRYM01000062__1__1__-54__ACCCCAGAACTTACTTATGCTGATTCCGGTGCGAAAATTGTTAATAAAGGTACT__250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=344<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>unique_read_sequence=ACCCCAGAACTTACTTATGCTGATTCCGGTGCGAAAATTGTTAATAAAGGTACT<tab>user_defined=1
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>105<tab>.<tab>NZ_JRYM01000040<tab>54772<tab>1<tab>NZ_JRYM01000075<tab>21<tab>1<tab>0<tab>alignment_overlap=20<tab>coverage_minus=136<tab>coverage_plus=138<tab>flanking_left=204<tab>flanking_right=250<tab>key=NZ_JRYM01000040__54772__1__NZ_JRYM01000075__1__1__20____204__250__0__0<tab>max_left=198<tab>max_left_minus=198<tab>max_left_plus=196<tab>max_min_left=112<tab>max_min_left_minus=112<tab>max_min_left_plus=112<tab>max_min_right=115<tab>max_min_right_minus=114<tab>max_min_right_plus=115<tab>max_pos_hash_score=412<tab>max_right=224<tab>max_right_minus=223<tab>max_right_plus=224<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=205<tab>side_1_continuation=0<tab>side_1_overlap=20<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=274
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>160<tab>.<tab>NZ_JRYM01000040<tab>54772<tab>1<tab>NZ_JRYM01000075<tab>21<tab>1<tab>0<tab>alignment_overlap=20<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000040__54772__1__NZ_JRYM01000075__1__1__20____250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=412<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=20<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>user_defined=1
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>132<tab>.<tab>NZ_JRYM01000048<tab>79<tab>-1<tab>NZ_JRYM01000088<tab>23<tab>1<tab>-72<tab>alignment_overlap=-72<tab>coverage_minus=52<tab>coverage_plus=83<tab>flanking_left=79<tab>flanking_right=250<tab>key=NZ_JRYM01000048__79__-1__NZ_JRYM01000088__23__1__-72__ATTTTGACCGATTGAGGTTTCCTATAGGTATTCATTCAAATATATCTCAGTTAGGAGTACTACTATTGTGAG__79__250__0__0<tab>max_left=79<tab>max_left_minus=78<tab>max_left_plus=79<tab>max_min_left=79<tab>max_min_left_minus=78<tab>max_min_left_plus=79<tab>max_min_right=79<tab>max_min_right_minus=61<tab>max_min_right_plus=79<tab>max_pos_hash_score=308<tab>max_right=170<tab>max_right_minus=167<tab>max_right_plus=170<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=90<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=135<tab>unique_read_sequence=ATTTTGACCGATTGAGGTTTCCTATAGGTATTCATTCAAATATATCTCAGTTAGGAGTACTACTATTGTGAG
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>166<tab>.<tab>NZ_JRYM01000048<tab>79<tab>-1<tab>NZ_JRYM01000088<tab>23<tab>1<tab>-72<tab>alignment_overlap=-72<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000048__79__-1__NZ_JRYM01000088__23__1__-72__ATTTTGACCGATTGAGGTTTCCTATAGGTATTCATTCAAATATATCTCAGTTAGGAGTACTACTATTGTGAG__250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=308<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>unique_read_sequence=ATTTTGACCGATTGAGGTTTCCTATAGGTATTCATTCAAATATATCTCAGTTAGGAGTACTACTATTGTGAG<tab>user_defined=1
jeffreybarrick commented 6 years ago

There's definitely something up with parsing the GD file output by UNION -- those do not seem to be duplicate entries. Can you email or post all-merged.gd so that I can use it as input to debug?

cjfields commented 6 years ago

@jeffreybarrick I'm checking in with the PI first to make sure (this isn't my data).

cjfields commented 6 years ago

Hi @jeffreybarrick I checked and the PI was fine with sharing to debug this. I shared a Box folder using your UT email, which contains the original GD calls and the merged run (using the above gdtools merge call) along with a description on how the original analysis was performed. My suspicion is that the fragmented nature of the assembly is partly to blame, but this unfortunately represents the closest reference available.

Let me know if you need anything else!

jeffreybarrick commented 6 years ago

Got it and will look into things. Thanks!

cjfields commented 6 years ago

A quick addition: I tried removing the ones above being reported as duplicates from the original evidence files since there were only a handful, then rerunning breseq. breseq does get past that step but then fails again during the last stage (08_mutation_identification) with a very similar error. I'll upload an example log to the Box folder.

cjfields commented 6 years ago

@jeffreybarrick let me know if you need the FASTQ files for testing. Thanks!

jeffreybarrick commented 6 years ago

@cjfields It looks like I do need a set of FASTQ files to reproduce the problem. Thanks!

cjfields commented 6 years ago

@jeffreybarrick I added the trimmed FASTQ to the same Box account. Thanks!

jeffreybarrick commented 6 years ago

@cjfields Thanks! - This helped me fix two rare cases that weren't covered. Both were related to having the reference in many contigs. The current version in the repo was able to complete runs on two of the FASTQ datasets that I checked, so I think it's good to go.

cjfields commented 6 years ago

@jeffreybarrick I'll give it a try and will report back. Thanks tremendously!

cjfields commented 6 years ago

@jeffreybarrick looks like that fixed the primary issue.

There are a few examples where a variant or two is missing, but it does capture the vast majority which should be fine for this study. We are also noticing that a negative quality score is reported for some calls (which confounds GATK and Picard downstream) but we can screen for those (they should all be 0). I can report those as a separate issue.

bensprung commented 1 year ago

I think there might be a minor display bug associated with gdtools merge: I'm following the steps outlined here, and I'm finding that if you do gdtools merge -e -o merged.gd a.gd b.gd c.gd d.gd then displayed under "Reading input GD files" only b.gd, c.gd, and d.gd are listed. Same thing happens when using *.gd. I am pretty sure this is just a bug in display; for instance touch 000.gd; gdtools merge -e -o merged.gd *.gd gives an error about 000.gd.

bensprung commented 1 year ago

Also just to clarify, when doing this process the second run using the merged .gd file must be done with the -p (--polymorphism-prediction) flag set?

jeffreybarrick commented 1 year ago

Yes, that's an error that it doesn't print 000.gd. It does get merged.

Yes, I would use -p for the second run, though it shouldn't actually matter, b/c any evidence item in the --user-evidence will be reported regardless of how many reads support it.

bensprung commented 1 year ago

Thanks!