barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
149 stars 21 forks source link

Information on 'unique tag' for duplicate entries in genome diff error #144

Closed jsa-aerial closed 3 years ago

jsa-aerial commented 6 years ago

I can't seem to find much of any information on what the 'magic' tag is, that's indicated in the error message:

Duplicate entries in Genome Diff:

Add a 'unique' tag to one if this is intentional. Even reading the code doesn't really offer much in the way of any information on what this is. Also there isn't much information on what/how this happens as the genome diff file is generated internally with no indication of the chain from user input to generation of the error condition. Thanks for any information.
jeffreybarrick commented 6 years ago

If you are seeing this "Duplicate entries in Genome Diff" error with a GD file directly output from breseq then it is due to a bug. breseq should never output two identical entries. If you are still experiencing this, please update to the current version of breseq and see if the problem persists.

The 'unique' tag is an undocumented way of dealing with some complex manual annotation cases. When manually annotating a set of genomes in a phylogeny and properly counting them when it is sometimes necessary to add the exact same GD entry twice – because a mutation occurs, then it gets reverted, and it occurs again. You can set "unique=1", "unique=2", etc. to make breseq accept this kind of manual annotation. We tend to use the phylogeny_id tag instead of this now. You can see some examples in LTEE-Ecoli genome diff files. If it's useful to you, then I can elaborate on things and try to document this.

Cheers!

cjfields commented 6 years ago

I'm seeing this as well, but in relation to #145, using breseq v0.31.1 (I'll try v0.32.0 to see if this persists).

...
>>> Error(s) in GenomeDiff format. FILE:  <<<
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>132<tab>.<tab>NZ_JRYM01000001<tab>221437<tab>-1<tab>NZ_JRYM01000050<tab>80<tab>-1<tab>0<tab>alignment_overlap=56<tab>coverage_minus=79<tab>coverage_plus=40<tab>flanking_left=250<tab>flanking_right=80<tab>key=NZ_JRYM01000001__221437__-1__NZ_JRYM01000050__136__-1__56____250__80__0__0<tab>max_left=188<tab>max_left_minus=185<tab>max_left_plus=188<tab>max_min_left=76<tab>max_min_left_minus=76<tab>max_min_left_plus=0<tab>max_min_right=80<tab>max_min_right_minus=80<tab>max_min_right_plus=79<tab>max_pos_hash_score=340<tab>max_right=80<tab>max_right_minus=80<tab>max_right_plus=79<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=84<tab>side_1_continuation=0<tab>side_1_overlap=56<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=119
Add a 'unique' tag to one if this is intentional.
>>LINE:     0
JC<tab>171<tab>.<tab>NZ_JRYM01000001<tab>221437<tab>-1<tab>NZ_JRYM01000050<tab>80<tab>-1<tab>0<tab>alignment_overlap=56<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000001__221437__-1__NZ_JRYM01000050__136__-1__56____250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=340<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=56<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>user_defined=1
ajlopatkin commented 5 years ago

Seeing this same error on breseq 0.33.2. Running about 200 paired-end samples results in about 40 that failed at the last step with this error. Tried running each sample individually as well, with no luck. I also tried version 0.33.0 and 0.33.1 to see if this was a regression, but all versions seem to have the same issue.


Predicting mutations from evidence...
  Preparing junctions...
  Predicting mobile element insertions...
  Predicting large deletions...
  Predicting small indels and substitutions from junctions...
  Predicting small indels and substitutions from alignments...
  Making final adjustments to mutations...
  Writing final GD file...

>>> Error(s) in GenomeDiff format. FILE:  <<<

>>ERROR: Attempt to add duplicate of this existing entry:
MOB<tab>12<tab>68,70<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15
Add a 'unique' tag to one if this is intentional.
>>ON LINE:     0
MOB<tab>13<tab>56,61<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15

>>ERROR: Attempt to add duplicate of this existing entry:
MOB<tab>13<tab>56,61<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15
Add a 'unique' tag to one if this is intentional.
>>ON LINE:     0
MOB<tab>14<tab>55,60<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Fatal formatting error in GD file being written: Results/output/output.gd
FILE: genome_diff.cpp   LINE: 730
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Backtrace with 5 stack frames.
0   breseq                              0x00000001016d00b5 _ZN6breseq16my_error_handlerEbbbPKcS1_iRKNSt3__112basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEE + 1205
1   breseq                              0x00000001017a379d _ZN6breseq11cGenomeDiff5writeERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE + 3069
2   breseq                              0x00000001016eca15 _Z21breseq_default_actioniPPc + 24341
3   breseq                              0x00000001016f610e main + 1662
4   libdyld.dylib                       0x00007fff5ba15ed9 start + 1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!```
jeffreybarrick commented 5 years ago

Could you give me some more info on your run and share an example dataset for debugging? It seems like you are using the --user-evidence-gd option? You can contact me at the email address in the breseq header to arrange sharing input files.

ajlopatkin commented 5 years ago

Sure thing, will send info for the files. This was actually just on the initial run, using breseq -t -r inputs/MG1655_orig.gb -o Results inputs/1_HT3CLBCX2.1.TTGGAATG_CTCTCTAT.unmapped.1.fastq.gz.

jeffreybarrick commented 5 years ago

The most recent Duplicate Mutation error is fixed in v0.34.0. See: https://github.com/barricklab/breseq/issues/203#issuecomment-531443108

mjohnson11 commented 4 years ago

Hi Jeff! I'm having this same issue with v0.35.1. This is my first foray into breseq so I could be making any number of other mistakes, but I am trying to rerun my 539 samples with a merged gd file like: gdtools merge -e -o ../../../Output/WGS/breseq_output/merged_gds.gd samp1/output/output.gd samp2/output/output.gd and then breseq -p --brief-html-output --user-evidence-gd ${OUTD}breseq_output/merged_gds.gd etc. where etc. are the same polymorphism options I used in the initial run without --user-evidence-gd (which worked fine as far as I can tell).

Then when running that command, for only some samples, I get this gd file duplicate error either when writing output.gd or when writing jc_evidence.gd.

>>> Error(s) in GenomeDiff format. FILE:  <<<

>>ERROR: Attempt to add duplicate of this existing entry:
JC<tab>156<tab>.<tab>chrXII<tab>460366<tab>-1<tab>chrXII<tab>460367<tab>1<tab>-4<tab>alignment_overlap=-4<tab>coverage_minus=1<tab>coverage_plus=3<tab>flanking_left=151<tab>flanking_right=151<tab>key=chrXII__460366__-1__chrXII__460367__1__-4__TTTT__151__151__1__0<tab>max_left=96<tab>max_left_minus=54<tab>max_left_plus=96<tab>max_min_left=54<tab>max_min_left_minus=54<tab>max_min_left_plus=44<tab>max_min_right=65<tab>max_min_right_minus=0<tab>max_min_right_plus=65<tab>max_pos_hash_score=206<tab>max_right=101<tab>max_right_minus=90<tab>max_right_plus=101<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=4<tab>side_1_continuation=29<tab>side_1_overlap=0<tab>side_1_redundant=1<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=4<tab>unique_read_sequence=TTTT
Add a 'unique' tag to one if this is intentional.
>>ON LINE:     0
JC<tab>754<tab>.<tab>chrXII<tab>460366<tab>-1<tab>chrXII<tab>460367<tab>1<tab>-4<tab>alignment_overlap=-4<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=151<tab>flanking_right=151<tab>key=chrXII__460366__-1__chrXII__460367__1__-4__TTTT__151__151__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=206<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=29<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>unique_read_sequence=TTTT<tab>user_defined=1

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Fatal formatting error in GD file being written: ../../../Output/WGS/breseq_output/rerun/G70_P1C04/05_alignment_correction/jc_evidence.gd
FILE: genome_diff.cpp   LINE: 746
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
breseq(+0x1b176) [0x557a5f8cc176]
breseq(+0xc3740) [0x557a5f974740]
breseq(+0x1a53db) [0x557a5fa563db]
breseq(+0x4ad91) [0x557a5f8fbd91]
breseq(+0xc652) [0x557a5f8bd652]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ac4623aa3d5]
breseq(+0x1a629) [0x557a5f8cb629]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Creating merged genome diff evidence file...
Predicting mutations from evidence...
  Preparing junctions...
  Predicting mobile element insertions...
  Predicting large deletions...
  Predicting small indels and substitutions from junctions...
  Predicting small indels and substitutions from alignments...
  Making final adjustments to mutations...
  Writing final GD file...

>>> Error(s) in GenomeDiff format. FILE:  <<<

>>ERROR: Attempt to add duplicate of this existing entry:
DEL<tab>18060<tab>237561<tab>chrIV<tab>369282<tab>4<tab>frequency=2.35701906e-01
Add a 'unique' tag to one if this is intentional.
>>ON LINE:     0
DEL<tab>18061<tab>136577,136578,136579,136580<tab>chrIV<tab>369282<tab>4<tab>frequency=1

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Fatal formatting error in GD file being written: ../../../Output/WGS/breseq_output/rerun/G70_P1C09/output/output.gd
FILE: genome_diff.cpp   LINE: 746
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
breseq(+0x1b176) [0x55588e4d6176]
breseq(+0xc3740) [0x55588e57e740]
breseq(+0x4c5ae) [0x55588e5075ae]
breseq(+0xc652) [0x55588e4c7652]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b509858e3d5]
breseq(+0x1a629) [0x55588e4d5629]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
jeffreybarrick commented 4 years ago

This seems like it is probably a new issue that manifests in the same way. Can you share the input files and command so that I can reproduce the problem? Email me at the link in the breseq header if you'd like for me to create a shared folder for uploading large files.

jeffreybarrick commented 4 years ago

I was able to reproduce the error, and this should now be fixed in the repository version. As a bonus, I fixed a related memory leak so memory usage shouldn't balloon at the output phase when there are many JC evidence items.

Let me know if you want me to make a temporary source or binary download for you to use before this gets included in a new released version.

Also, I see that you are analyzing yeast data that has a lot of polymorphisms. In general, the yeast genome and having a lot of polymorphisms challenges some of the analysis assumptions made by breseq, so I hope you can get usable output. One problem specific to yeast is the repetitive, but not exactly repetitive telomeric regions. Things might run more smoothly if you can mask these (as N's in the reference file).

mjohnson11 commented 4 years ago

Thanks! I will look into using the repository version soon! And thanks for the tips on yeast workarounds - I'm currently running GATK too and want to compare the output with breseq, depending on how it looks I might end up doing that masking - let me know if you know of anyone who has done that before, I might want to ask them about which specific regions they masked (alternately I would just look up where these repetitive regions stop and start, but if someone is ahead of me on that I'd take the shortcut!).