Closed jsa-aerial closed 3 years ago
If you are seeing this "Duplicate entries in Genome Diff" error with a GD file directly output from breseq then it is due to a bug. breseq should never output two identical entries. If you are still experiencing this, please update to the current version of breseq and see if the problem persists.
The 'unique' tag is an undocumented way of dealing with some complex manual annotation cases. When manually annotating a set of genomes in a phylogeny and properly counting them when it is sometimes necessary to add the exact same GD entry twice – because a mutation occurs, then it gets reverted, and it occurs again. You can set "unique=1", "unique=2", etc. to make breseq accept this kind of manual annotation. We tend to use the phylogeny_id tag instead of this now. You can see some examples in LTEE-Ecoli genome diff files. If it's useful to you, then I can elaborate on things and try to document this.
Cheers!
I'm seeing this as well, but in relation to #145, using breseq v0.31.1 (I'll try v0.32.0 to see if this persists).
...
>>> Error(s) in GenomeDiff format. FILE: <<<
>>ERROR: Attempt to add duplicate of this existing entry from line NA:
JC<tab>132<tab>.<tab>NZ_JRYM01000001<tab>221437<tab>-1<tab>NZ_JRYM01000050<tab>80<tab>-1<tab>0<tab>alignment_overlap=56<tab>coverage_minus=79<tab>coverage_plus=40<tab>flanking_left=250<tab>flanking_right=80<tab>key=NZ_JRYM01000001__221437__-1__NZ_JRYM01000050__136__-1__56____250__80__0__0<tab>max_left=188<tab>max_left_minus=185<tab>max_left_plus=188<tab>max_min_left=76<tab>max_min_left_minus=76<tab>max_min_left_plus=0<tab>max_min_right=80<tab>max_min_right_minus=80<tab>max_min_right_plus=79<tab>max_pos_hash_score=340<tab>max_right=80<tab>max_right_minus=80<tab>max_right_plus=79<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=84<tab>side_1_continuation=0<tab>side_1_overlap=56<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=119
Add a 'unique' tag to one if this is intentional.
>>LINE: 0
JC<tab>171<tab>.<tab>NZ_JRYM01000001<tab>221437<tab>-1<tab>NZ_JRYM01000050<tab>80<tab>-1<tab>0<tab>alignment_overlap=56<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=250<tab>flanking_right=250<tab>key=NZ_JRYM01000001__221437__-1__NZ_JRYM01000050__136__-1__56____250__250__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=340<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=0<tab>side_1_overlap=56<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>user_defined=1
Seeing this same error on breseq 0.33.2. Running about 200 paired-end samples results in about 40 that failed at the last step with this error. Tried running each sample individually as well, with no luck. I also tried version 0.33.0 and 0.33.1 to see if this was a regression, but all versions seem to have the same issue.
Predicting mutations from evidence...
Preparing junctions...
Predicting mobile element insertions...
Predicting large deletions...
Predicting small indels and substitutions from junctions...
Predicting small indels and substitutions from alignments...
Making final adjustments to mutations...
Writing final GD file...
>>> Error(s) in GenomeDiff format. FILE: <<<
>>ERROR: Attempt to add duplicate of this existing entry:
MOB<tab>12<tab>68,70<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15
Add a 'unique' tag to one if this is intentional.
>>ON LINE: 0
MOB<tab>13<tab>56,61<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15
>>ERROR: Attempt to add duplicate of this existing entry:
MOB<tab>13<tab>56,61<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15
Add a 'unique' tag to one if this is intentional.
>>ON LINE: 0
MOB<tab>14<tab>55,60<tab>U00096<tab>2864700<tab>IS1<tab>-1<tab>9<tab>before=15
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Fatal formatting error in GD file being written: Results/output/output.gd
FILE: genome_diff.cpp LINE: 730
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Backtrace with 5 stack frames.
0 breseq 0x00000001016d00b5 _ZN6breseq16my_error_handlerEbbbPKcS1_iRKNSt3__112basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEE + 1205
1 breseq 0x00000001017a379d _ZN6breseq11cGenomeDiff5writeERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEE + 3069
2 breseq 0x00000001016eca15 _Z21breseq_default_actioniPPc + 24341
3 breseq 0x00000001016f610e main + 1662
4 libdyld.dylib 0x00007fff5ba15ed9 start + 1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!```
Could you give me some more info on your run and share an example dataset for debugging? It seems like you are using the --user-evidence-gd
option? You can contact me at the email address in the breseq header to arrange sharing input files.
Sure thing, will send info for the files. This was actually just on the initial run, using breseq -t -r inputs/MG1655_orig.gb -o Results inputs/1_HT3CLBCX2.1.TTGGAATG_CTCTCTAT.unmapped.1.fastq.gz
.
The most recent Duplicate Mutation error is fixed in v0.34.0. See: https://github.com/barricklab/breseq/issues/203#issuecomment-531443108
Hi Jeff! I'm having this same issue with v0.35.1. This is my first foray into breseq so I could be making any number of other mistakes, but I am trying to rerun my 539 samples with a merged gd file like:
gdtools merge -e -o ../../../Output/WGS/breseq_output/merged_gds.gd samp1/output/output.gd samp2/output/output.gd
and then
breseq -p --brief-html-output --user-evidence-gd ${OUTD}breseq_output/merged_gds.gd etc.
where etc. are the same polymorphism options I used in the initial run without --user-evidence-gd (which worked fine as far as I can tell).
Then when running that command, for only some samples, I get this gd file duplicate error either when writing output.gd or when writing jc_evidence.gd.
>>> Error(s) in GenomeDiff format. FILE: <<<
>>ERROR: Attempt to add duplicate of this existing entry:
JC<tab>156<tab>.<tab>chrXII<tab>460366<tab>-1<tab>chrXII<tab>460367<tab>1<tab>-4<tab>alignment_overlap=-4<tab>coverage_minus=1<tab>coverage_plus=3<tab>flanking_left=151<tab>flanking_right=151<tab>key=chrXII__460366__-1__chrXII__460367__1__-4__TTTT__151__151__1__0<tab>max_left=96<tab>max_left_minus=54<tab>max_left_plus=96<tab>max_min_left=54<tab>max_min_left_minus=54<tab>max_min_left_plus=44<tab>max_min_right=65<tab>max_min_right_minus=0<tab>max_min_right_plus=65<tab>max_pos_hash_score=206<tab>max_right=101<tab>max_right_minus=90<tab>max_right_plus=101<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=4<tab>side_1_continuation=29<tab>side_1_overlap=0<tab>side_1_redundant=1<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=4<tab>unique_read_sequence=TTTT
Add a 'unique' tag to one if this is intentional.
>>ON LINE: 0
JC<tab>754<tab>.<tab>chrXII<tab>460366<tab>-1<tab>chrXII<tab>460367<tab>1<tab>-4<tab>alignment_overlap=-4<tab>coverage_minus=0<tab>coverage_plus=0<tab>flanking_left=151<tab>flanking_right=151<tab>key=chrXII__460366__-1__chrXII__460367__1__-4__TTTT__151__151__0__0__UD<tab>max_left=0<tab>max_left_minus=0<tab>max_left_plus=0<tab>max_min_left=0<tab>max_min_left_minus=0<tab>max_min_left_plus=0<tab>max_min_right=0<tab>max_min_right_minus=0<tab>max_min_right_plus=0<tab>max_pos_hash_score=206<tab>max_right=0<tab>max_right_minus=0<tab>max_right_plus=0<tab>neg_log10_pos_hash_p_value=NT<tab>pos_hash_score=0<tab>reject=COVERAGE_EVENNESS_SKEW<tab>side_1_continuation=29<tab>side_1_overlap=0<tab>side_1_redundant=0<tab>side_2_continuation=0<tab>side_2_overlap=0<tab>side_2_redundant=0<tab>total_non_overlap_reads=0<tab>unique_read_sequence=TTTT<tab>user_defined=1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Fatal formatting error in GD file being written: ../../../Output/WGS/breseq_output/rerun/G70_P1C04/05_alignment_correction/jc_evidence.gd
FILE: genome_diff.cpp LINE: 746
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
breseq(+0x1b176) [0x557a5f8cc176]
breseq(+0xc3740) [0x557a5f974740]
breseq(+0x1a53db) [0x557a5fa563db]
breseq(+0x4ad91) [0x557a5f8fbd91]
breseq(+0xc652) [0x557a5f8bd652]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2ac4623aa3d5]
breseq(+0x1a629) [0x557a5f8cb629]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Creating merged genome diff evidence file...
Predicting mutations from evidence...
Preparing junctions...
Predicting mobile element insertions...
Predicting large deletions...
Predicting small indels and substitutions from junctions...
Predicting small indels and substitutions from alignments...
Making final adjustments to mutations...
Writing final GD file...
>>> Error(s) in GenomeDiff format. FILE: <<<
>>ERROR: Attempt to add duplicate of this existing entry:
DEL<tab>18060<tab>237561<tab>chrIV<tab>369282<tab>4<tab>frequency=2.35701906e-01
Add a 'unique' tag to one if this is intentional.
>>ON LINE: 0
DEL<tab>18061<tab>136577,136578,136579,136580<tab>chrIV<tab>369282<tab>4<tab>frequency=1
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Fatal formatting error in GD file being written: ../../../Output/WGS/breseq_output/rerun/G70_P1C09/output/output.gd
FILE: genome_diff.cpp LINE: 746
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> STACK TRACE <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
breseq(+0x1b176) [0x55588e4d6176]
breseq(+0xc3740) [0x55588e57e740]
breseq(+0x4c5ae) [0x55588e5075ae]
breseq(+0xc652) [0x55588e4c7652]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b509858e3d5]
breseq(+0x1a629) [0x55588e4d5629]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
This seems like it is probably a new issue that manifests in the same way. Can you share the input files and command so that I can reproduce the problem? Email me at the link in the breseq header if you'd like for me to create a shared folder for uploading large files.
I was able to reproduce the error, and this should now be fixed in the repository version. As a bonus, I fixed a related memory leak so memory usage shouldn't balloon at the output phase when there are many JC evidence items.
Let me know if you want me to make a temporary source or binary download for you to use before this gets included in a new released version.
Also, I see that you are analyzing yeast data that has a lot of polymorphisms. In general, the yeast genome and having a lot of polymorphisms challenges some of the analysis assumptions made by breseq, so I hope you can get usable output. One problem specific to yeast is the repetitive, but not exactly repetitive telomeric regions. Things might run more smoothly if you can mask these (as N's in the reference file).
Thanks! I will look into using the repository version soon! And thanks for the tips on yeast workarounds - I'm currently running GATK too and want to compare the output with breseq, depending on how it looks I might end up doing that masking - let me know if you know of anyone who has done that before, I might want to ask them about which specific regions they masked (alternately I would just look up where these repetitive regions stop and start, but if someone is ahead of me on that I'd take the shortcut!).
I can't seem to find much of any information on what the 'magic' tag is, that's indicated in the error message:
Duplicate entries in Genome Diff: