biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

markdup segmentation fault #393

Closed ZhifeiLuo closed 4 years ago

ZhifeiLuo commented 5 years ago

sambamba 0.6.8 from bioconda

I run markdup on a cluster and generated the segmentation fault error and left a core.* temporary file. This error is consistent among all the computer with different CPU (either Xeon or AMD Opteron). Any thoughts?

rikrdo89 commented 5 years ago

I got the same error "Segmentation fault (core dumped)" using sambamba 0.7.0 installed with conda and running in an LSF cluster.... :/

pjotrp commented 5 years ago

Conda builds with a recent ldc. That is the problem at this point.

pjotrp commented 4 years ago

A heads up.

At this point build sambamba with an older ldc - like the binary released on github.

We decided to replace the original bam reader with a new one I wrote almost 2 years ago and we have been testing. Rather than fix the GC related issue we are going to use the new bamreader which is simpler and therefore (hopefully) easier to maintain. @NickRoz1 who worked as a Google Summer of Code student for me on column based bams is doing that work.

One important difference is that the worker threads are no longer part of the reader itself. My theory is that performance should be similar ;)

vanottee commented 4 years ago

Same segmentation fault issue here - had installed on HPC system with bioconda. Is there a way to know when the Conda build has been updated? Alternatively, if I extract from the 7.0 source code, how do I actually execute a tool, like markdup? Thanks!

pjotrp commented 4 years ago

I am working on fixing this bug. Track progress here https://thebird.nl/blog/work/rotate.html

pjotrp commented 4 years ago

@sjackman we have a new release of sambamba that fixes a long standing bug with the D runtime and should now compile with all versions of ldc. See also https://travis-ci.org/biod/sambamba

sjackman commented 4 years ago

Thanks for the heads up, Pjotr. Once you've tagged a new release, would you like to open a PR to bump the version of Sambamba in Brewsci/bio? You need to change only lines, and you can do it from the GitHub web interface if you like. See https://github.com/brewsci/homebrew-bio/blob/c5b38cfea1eff4b18ae19d9dded00f990769de84/Formula/sambamba.rb#L6-L7

pjotrp commented 4 years ago

@sjackman feels dirty to edit through the web interface. But hopefully it works :)

sjackman commented 4 years ago

Hehe. Thanks!

pmagwene commented 3 years ago

I hate to revive a closed issue but I'm running into the segmentation fault issue with markdup in 0.8.0 as installed from conda.

My build:

sambamba 0.8.0
 by Artem Tarasov and Pjotr Prins (C) 2012-2020
    LDC 1.24.0 / DMD v2.094.1 / LLVM11.0.0 / bootstrap LDC - the LLVM D compiler (1.24.0)

System info:

pjotrp commented 3 years ago

Does this happen reproducibly? Can you share your BAM file in that case? Hard to fix stuff if we don't get reproducible errors.

pmagwene commented 3 years ago

An interesting twist on this. The file linked below reproducibly generates segmentation faults on my system when running single threaded (or whatever the default threads is; sambamba markdup -h doesn't indicate) but will complete OK if running with with -t <= 10 (i9-9900k is 8 core/16 threads).

https://drive.google.com/file/d/1owSZqrrWkrfbsEdf81iLrlByyOmfU4hs/view?usp=sharing

pjotrp commented 3 years ago

@pmagwene, thanks for the test file. This is a new issue and not related to earlier segfaults. I can not reproduce it on my AMD Ryzen 7 3700X 8-Core Processor and on a Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz. You may want try the static binary I'll release in a bit at https://github.com/biod/sambamba/releases. If that fails I suggest opening a new issue for the i9-9900k.

Emma1997WTT commented 2 years ago

ah maybe this is due to a high number of duplicates. I increased the --hash-table-size=4194304 and then it runs through without segfault

Hello, I solved my segfault with --hash-table-size=4194304!!! but I don't quite understand the meaning of this parameter, can you explain it to me? Thank you so much

NickRoz1 commented 2 years ago

Hi. This parameter sets the size of a hashmap. https://en.m.wikipedia.org/wiki/Hash_table https://github.com/biod/sambamba/blob/8a4102f9b2f95799dd0f432e6db15c57df75f537/sambamba/markdup.d#L318

If you have 10 keys value pairs but the size of hashmap is only 9, then some key will be mapped to the same cell as another key ( a duplicate ). The same will happen if you have 9 non distinct keys.

In case of sambamba I think it moves duplicate values to a separate list, and then uses temp files to store these values if list is filled up.

Maybe this functionality wasn't tested well or it didn't account for dataset of your size or with your number of duplicates ( https://github.com/biod/sambamba/blob/8a4102f9b2f95799dd0f432e6db15c57df75f537/sambamba/markdup.d#L467 ).

On Wed, Aug 3, 2022, 11:35 Emma1997WTT @.***> wrote:

ah maybe this is due to a high number of duplicates. I increased the --hash-table-size=4194304 and then it runs through without segfault

Hello, I solved my segfault with --hash-table-size=4194304!!! but I don't quite understand the meaning of this parameter, can you explain it to me? Thank you so much

— Reply to this email directly, view it on GitHub https://github.com/biod/sambamba/issues/393#issuecomment-1203652563, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJFOIXDZK3WCPX3KQAVZYTVXIVNVANCNFSM4HCFFO7Q . You are receiving this because you were mentioned.Message ID: @.***>

Emma1997WTT commented 2 years ago

Hi. This parameter sets the size of a hashmap. https://en.m.wikipedia.org/wiki/Hash_table https://github.com/biod/sambamba/blob/8a4102f9b2f95799dd0f432e6db15c57df75f537/sambamba/markdup.d#L318 If you have 10 keys value pairs but the size of hashmap is only 9, then some key will be mapped to the same cell as another key ( a duplicate ). The same will happen if you have 9 non distinct keys. In case of sambamba I think it moves duplicate values to a separate list, and then uses temp files to store these values if list is filled up. Maybe this functionality wasn't tested well or it didn't account for dataset of your size or with your number of duplicates ( https://github.com/biod/sambamba/blob/8a4102f9b2f95799dd0f432e6db15c57df75f537/sambamba/markdup.d#L467 ). On Wed, Aug 3, 2022, 11:35 Emma1997WTT @.> wrote: ah maybe this is due to a high number of duplicates. I increased the --hash-table-size=4194304 and then it runs through without segfault Hello, I solved my segfault with --hash-table-size=4194304!!! but I don't quite understand the meaning of this parameter, can you explain it to me? Thank you so much — Reply to this email directly, view it on GitHub <#393 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJFOIXDZK3WCPX3KQAVZYTVXIVNVANCNFSM4HCFFO7Q . You are receiving this because you were mentioned.Message ID: @.>

Thanks so much for your quick and professional reply,I get it now! Have a nice day!