biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

markdup segmentation fault #393

Closed ZhifeiLuo closed 4 years ago

ZhifeiLuo commented 5 years ago

sambamba 0.6.8 from bioconda

I run markdup on a cluster and generated the segmentation fault error and left a core.* temporary file. This error is consistent among all the computer with different CPU (either Xeon or AMD Opteron). Any thoughts?

steffenheyne commented 5 years ago

probably similar issue here. I'm also on conda v0.6.8

For test purposes I have a very small bam and it seems releated to the number of threads used. The same small bam with -t 1..8 works fine, with -t 9 it segfaults Maybe some threads get zero data!?

$sambamba markdup --remove-duplicates -t 9 in.bam out.bam

sambamba 0.6.8 by Artem Tarasov and Pjotr Prins (C) 2012-2018
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
  sorted 324798 end pairs
     and 324 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 57 ms
  found 361424 duplicates
collected list of positions in 0 min 0 sec
Segmentation fault (core dumped)
steffenheyne commented 5 years ago

ah maybe this is due to a high number of duplicates. I increased the --hash-table-size=4194304 and then it runs through without segfault

steffenheyne commented 5 years ago

so actually the segfaults seem more random, I changed parameters and sometimes it crashes, sometimes it runs through...so far unpredictable what causes this....

bioconda version/build issue? underlying OS is ubuntu 18.04

steffenheyne commented 5 years ago

here is a dump catchsegv.1.txt

pjotrp commented 5 years ago

Thanks for the dump. I am happy to chase the problem. You can see it is happening in the garbage collector. Can you try the latest binary release and see if this does the same? https://github.com/biod/sambamba/releases

Also, what hardware are you on. Some Xeons have threading problems.

steffenheyne commented 5 years ago

different cloud servers, but last one was a Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz

dump from cloned master:

catchsegv.2.txt

I compiled with make debug, but looks less informative than with the conda version

pjotrp commented 5 years ago

I think that is one of the unreliable Xeons for hyperthreading. Sambamba is one of the rare tools which brings it out by utilizing all cores. See also #335. Check the list by Intel.

steffenheyne commented 5 years ago

mhmm that's bad. so using less cores could be more stable?

I also get a segfault with sambamba depth ....

pjotrp commented 5 years ago

One option is to turn off hyperthreading. Sadly that is hard in most HPC environments. Alternatively don't use those machines or rerun until it gets through. You can blame Intel.

steffenheyne commented 5 years ago

yeah thanks!

Btw, I downgraded to v0.6.6 via conda and this seems more stable, so far I don't get any segfaults with the same data.

Does this version uses a different underlying library or like that?

pjotrp commented 5 years ago

It is possible. What version of LLVM and ldc is it using? Underlying tools are evolving fast. Sambamba did not change markdup essentially between those versions.

steffenheyne commented 5 years ago

It looks like the 0.6.6 conda version uses your pre-build packages from github (https://github.com/biod/sambamba/releases/download/v0.6.6/sambamba_v0.6.6_linux.tar.bz2),

whereas the 0.6.8. build it from scratch on the conda server as given here: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/sambamba/build.sh

pjotrp commented 5 years ago

Can you try running the pre-built version for 0.6.9?

steffenheyne commented 5 years ago

unfortunately this doesn't help, the pre-build 0.6.9 also segfaults, and also a locally build 0.7.0-pre1

but what I observed I rarely/(never?) got segfaults with all versions from 1-8 threads, strangely also not with the local build with 9 threads, higher thread numbers really often/(always?) segfaults

steffenheyne commented 5 years ago

yeah with your manually (not using conda) downloaded pre-build 0.6.6 it also (so far) never crashes with whatever number of threads

RichardCorbett commented 5 years ago

Hi folks, I'm using sambamba-0.6.9-linux-static on CentOS 6 and 7 and getting what looks like a similar error.

I ran 40 different samples and only 10 got this error, but I tried this command on a few machines and they all seg faulted.

Do you have any suggestions?

thanks, Richard

/projects/COLO829SomaticTitration/sambamba-0.6.9-linux-static markdup --tmpdir /projects/COLO829SomaticTitration /projects/COLO829SomaticTitration/normals/A36973_5_lanes_dupsFlagged.bam.spec.bam.p1.bam /projects/COLO829SomaticTitration/normals/A36973_5_lanes_dupsFlagged.bam.spec.bam.p1.bam.dupmark.bam

sambamba 0.6.9 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.14.0 / DMD v2.084.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
  sorted 147834456 end pairs
     and 4860622 single ends (among them 1137 unmatched pairs)
  collecting indices of duplicate reads...   done in 35457 ms
  found 2953009 duplicates
collected list of positions in 26 min 49 sec
marking duplicates...
Segmentation fault (core dumped)
pjotrp commented 5 years ago

@RichardCorbett what is the CPU you are on? Does using an older binary of sambamba work?

RichardCorbett commented 5 years ago

CPU tested: Intel Xeon E5 -2650 @ 2.20 GHz

I just had some filesystem problems overnight so retesting is going to be a challenge. In case it may be relevant I noticed that I was testing on BWA aln 0.5.7 BAM file which includes non-zero mapping qualities for unaligned reads.

pjotrp commented 5 years ago

Yeah, this may be another Xeon with problems. It will be interesting to see if older versions of sambamba+llvm+ldc show the same problem.

@steffenheyne can you confirm you are still good with 0.6.6 binary? And both of you, is there any way I could access one of these machines so I can debug the problem? If 0.6.6 works I could use that same build chain to compile a new binary and see if that keeps going.

pjotrp commented 5 years ago

sambamba 0.6.6 was built with

    LDC 0.17.1
    using DMD v2.068.2
    using LLVM 3.8.0
pjotrp commented 5 years ago

Another thing we could try is building sambamba with the GNU D compiler. That would fine tune the diagnostics - whether it is with LLVM or D-runtime.

RichardCorbett commented 5 years ago

A second test run overnight duplicated the error on a machine with this cpu: 144 x Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz

pjotrp commented 5 years ago

Great. I am starting to suspect the D runtime. I should try hitting one of our machines badly.

dpryan79 commented 5 years ago

I'll try rebuilding 0.6.6 in bioconda with more recent pinnings to see if (A) it still works without segfaulting and (B) it can still be combined with other recent packages in an environment. That will allow us (@steffenheyne and me) to get around the current issue. I agree that this is likely a D runtime issue.

pjotrp commented 5 years ago

Thanks. I am too busy now to sort it now but should have time by Easter.

sirselim commented 5 years ago

Just adding my observation of this issue as well. Processors: 2x Intel Xeon Gold 5118 (12 core, based on Skylake)

Using conda Sambamaba (0.6.8 and now 0.6.9) in a snakemake workflow. Getting segfaults at markdup stage.

Set the markdup rule to call 0.6.6 from conda and things seem to be processing well.

pjotrp commented 5 years ago

Thanks. This is very annoying, good thing we have 0.6.6.

pjotrp commented 5 years ago

I am on to this now. Can someone send me a BAM file that is proven to fail? I can run it on different architectures next week and see if we can make it work. I suspect D's garbage collector goes out of sync with the threadpool we are using. If that is not it, it may be a rogue malloc/free. But I need something that will consistently segfault on one of our machines. Worst case I can try one of them cloud machines too ;)

dpryan79 commented 5 years ago

I don't know that I've seen any that produce this 100% of the time. It seems like a race condition, since if I rerun things enough times eventually it will work. But I can try sending you something tomorrow unless someone else replies with a better file beforehand.

BTW, thanks for looking into this!

pjotrp commented 5 years ago

Thanks for helping out! If you need a place to upload I can set something up next week.

dpryan79 commented 5 years ago

Not an issue, we have a standard file hosting server set up.

pjotrp commented 5 years ago

@dpryan79 ready. You can send details to pjotr dot public98 at thebird dot nl.

dpryan79 commented 5 years ago

Here's a test file: http://unishare.ie-freiburg.mpg.de:2915/fop/jsdllLo3/SRR2096206.bam

Note that I can only reproduce it on some servers and not others. 0.6.6 doesn't segfault on any of the servers we have.

pjotrp commented 5 years ago

OK, got the BAM file. I'll report on proceedings.

pjotrp commented 5 years ago

Segfault on first try on a Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz with

    ./sambamba-0.6.9-linux-static markdup SRR2096206.bam test.bam
Thread 39 "sambamba-0.6.9-" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 10549]
0x00000000005ef6e4 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv 

I need to build a debug version now.

pjotrp commented 5 years ago
gdb -ex 'handle SIGUSR1 SIGUSR2 nostop noprint' --args ./sambamba-0.6.9-linux-static markdup SRR2096206.bam test.bam

Running in the debugger gives

error reading variable: access outside bounds of object referenced via synthetic pointer

#2  0x0000000000613feb in rt_finalizeFromGC ()
#3  0x000000000062755e in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()
#4  0x00000000006251c3 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#5  0x0000000000626687 in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()
#6  0x0000000000623984 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#7  0x00000000006259f3 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#8  0x000000000060bdb0 in gc_qalloc ()
#9  0x0000000000608be2 in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()
#10 0x00000000006135cf in _d_newarrayU ()
#11 0x00000000004c47e5 in _D3std5array__T14arrayAllocImplVbi0TAhTmZQBaFNaNbmZQp (_param_0=140731428614032)
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/array.d:950
#12 _D3std5array__T18uninitializedArrayTAhTmZQBaFNaNbNemZQr (_param_0=140731428614032)
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/array.d:819
#13 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8allocateMFNaNbNfmZAh (this=<optimized out>, size=<optimized out>)
    at /export/export/local/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:178
#14 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8readNextMFZv (this=0x7ffe87e7c500)
    at /export/export/local/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:166
#15 0x00000000004eab28 in _D3std11parallelism8TaskPool__T3mapS3bioQBm3hts3bam11multireader17adjustTagsInRangeZ__TQCeTSQDm5range__T11ZipShortestVEQEn8typecons__T4FlagVAyaa18_616c6c4b6e6f776e53616d654c656e677468ZQByi0TSQHh9algorithm9iteration__T9MapResultSQHkQIvQHjQHiQHh__T13multiBamReadsTSQItQKeQIsQIr9readrange__T12BamReadRangeSQKhQLsQKgQKfQBo14withoutOffsetsZQBvZQDpFQDdmZ9__lambda3TSQNuQKi__TQKfVQJvi0TQEnTSQOsQLg__T6RepeatTmZQkZQLqZQHlTSQPxQMl__TQBfTCQPdQQoQPc5utils15samheadermerger15SamHeaderMergerZQDiTQDwZQOtZQRsMFQPpmmZ3Map7fillBufMFASQTwQPj__T5TupleTSQTfQUqQTeQTdQTc__T12MultiBamReadTSQUnQVyQUmQUl4read7BamReadZQBqTQGpZQCyZQDn () at /export/export/local/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:101
#16 0x00000000004eb433 in _D3std11parallelism__T3runTDFASQBd8typecons__T5TupleTS3bioQCe3hts3bam11multireader__T12MultiBamReadTSQBvQDyQBuQBt4read7BamReadZQBqTCQDaQFdQCz5utils15samheadermerger15SamHeaderMergerZQFgZQGbTQGfZQGqFQGpKQGrZQGv (
    fpOrDelegate=<error reading variable: access outside bounds of object referenced via synthetic pointer>, _param_1=...)
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/parallelism.d:764
#17 _D3std11parallelism__T4TaskSQBaQz3runTDFASQBo8typecons__T5TupleTS3bioQCp3hts3bam11multireader__T12MultiBamReadTSQBvQEjQBuQBt4read7BamReadZQBqTCQDaQFoQCz5utils15samheadermerger15SamHeaderMergerZQFgZQGbTQGfZQHb4implFPvZv (myTask=0x7ffff7f006b8 "")
    at /gnu/store/qzcgba5bflcgxsinabfpfy7j5nddvj4f-ldc-1.14.0/include/d/std/parallelism.d:452
#18 0x00000000005ef2c1 in _D3std11parallelism8TaskPool5doJobMFPSQBkQBj12AbstractTaskZv ()
#19 0x00000000005ef40b in std.parallelism.TaskPool.startWorkLoop() ()
#20 0x000000000060976c in thread_entryPoint ()
#21 0x00000000005c5cb9 in start_thread ()
#22 0x000000000068e3db in clone ()
pjotrp commented 5 years ago

A rebuild of sambamba ran without crash the first time with

 LDC 1.10.0 / DMD v2.080.1 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.4)
pjotrp commented 5 years ago

I put out a new binary release which you can use for testing. Next step I'll need to build with a more recent compiler (again). I'll try to compile with the new DIP1000 as in #399

pjotrp commented 5 years ago

A build with the latest version of LDC 1.16.0-beta consistently segfaults:

  LDC 1.16.0 / DMD v2.086.0 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

The stack trace differs from the earlier one with

0x00000000005dd9c4 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()
(gdb) bt                                                                                                                 
#0  0x00000000005dd9c4 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()       
#1  0x00000000004ef65e in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQD
xQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x2828282828282828)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:605
#2  0x000000000060155b in rt_finalizeFromGC ()
#3  0x0000000000615d49 in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()
#4  0x0000000000612c15 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()
#5  0x0000000000614681 in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()
#6  0x00000000006110de in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocT
imelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()
#7  0x0000000000613663 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#8  0x00000000005f8b20 in gc_qalloc ()
#9  0x00000000005f5822 in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()
#10 0x0000000000600b3f in _d_newarrayU ()
#11 0x00000000004c9435 in _D3std5array__T14arrayAllocImplVbi0TAhTmZQBaFNaNbmZQp (_param_0=140731236642144)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/array.d:953
#12 _D3std5array__T18uninitializedArrayTAhTmZQBaFNaNbNemZQr (_param_0=140731236642144)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/array.d:822
#13 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8allocateMFNaNbNfmZAh (this=<optimized out>, size=<optimized out>)
    at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:178
#14 _D3bio3std3hts3bam9readrange__T12BamReadRangeSQBsQBrQBqQBpQBo14withoutOffsetsZQBv8readNextMFZv (this=0x7ffe8a36c900)
    at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:166

which looks like bgzf decompress and the GC sweep are interacting...

pjotrp commented 5 years ago
#13 at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/readrange.d:178

points at

            _alloc_buffer = uninitializedArray!(ubyte[])(max(size, 65536));

which triggers a GC sweep and segfaults.

The stack trace is informative and it looks like we have invalid pointers on the stack.

pjotrp commented 5 years ago

The debug version of sambamba renders

0x000000000070e814 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()                                                              (gdb) bt                                                                                                                                                       #0  0x000000000070e814 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()                                                          #1  0x000000000059a2a5 in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQD
xQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x2828282828282828)                                                 
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:605                                                            #2  0x0000000000599b0b in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf6__dtorMFNfZv (this=0x2828282828282828)                                                                       at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:747                                                            #3  0x000000000074064c in object.TypeInfo_Struct.destroy(void*) const ()
#4  0x000000000074a1f7 in rt.lifetime.finalize_array(void*, ulong, const(TypeInfo_Struct)) ()                                                                  #5  0x000000000074b456 in rt.lifetime.finalize_array2(void*, ulong) ()                                                                                         #6  0x000000000074b74a in rt_finalizeFromGC ()                                                                                                                 #7  0x000000000076c798 in _D2gc4impl12conservativeQw3Gcx5sweepMFNbZm ()                                                                                       
#8  0x0000000000767ff0 in _D2gc4impl12conservativeQw3Gcx11fullcollectMFNbbZm ()                        
#9  0x0000000000769f1c in _D2gc4impl12conservativeQw3Gcx8bigAllocMFNbmKmkxC8TypeInfoZPv ()                                                                     #10 0x0000000000765123 in _D2gc4impl12conservativeQw3Gcx5allocMFNbmKmkxC8TypeInfoZPv ()                                                                        #11 0x0000000000765059 in _D2gc4impl12conservativeQw14ConservativeGC12mallocNoSyncMFNbmkKmxC8TypeInfoZPv ()                                                    #12 0x0000000000764fa2 in _D2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCeQCeQCcQCnQBs12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQEgQEgQEeQEp10mallocTimelS_DQFiQFiQFgQFr10numMallocslTmTkTmTxQCzZQFcMFNbKmKkKmKxQDsZQDl ()                                                                                          #13 0x00000000007651e6 in _D2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()
#14 0x0000000000768aa8 in _DThn16_2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkxC8TypeInfoZS4core6memory8BlkInfo_ ()                                   #15 0x000000000073d3a1 in gc_qalloc ()                                                                                                                         #16 0x00000000007374ca in _D4core6memory2GC6qallocFNaNbmkxC8TypeInfoZSQBqQBo8BlkInfo_ ()                                                                       #17 0x00000000007499d7 in _D2rt8lifetime12__arrayAllocFNaNbmxC8TypeInfoxQlZS4core6memory8BlkInfo_ ()                                                           #18 0x000000000074ab37 in _d_newarrayU ()                                                                                                                      #19 0x000000000074abed in _d_newarrayT ()
#20 0x000000000051b07e in _D7contrib6undead6stream14BufferedStream6__ctorMFCQBwQBrQBn6StreammZCQCpQCkQCgQCc (this=0x7ffe8a37f380, source=0x7ffe8a30ce40,           bufferSize=134217728) at /home/wrk/izip/git/opensource/D/sambamba/BioD/contrib/undead/stream.d:1628
#21 0x000000000054f772 in bio.std.hts.bam.reader.BamReader.getNativeEndianSourceStream() (this=0x7ffe8a312d00)                                             
    at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/bam/reader.d:517                          
...

where the last line refers to another allocation

    return new BufferedStream(file, _buffer_size);
pjotrp commented 5 years ago

This may be what I am looking for

#0  0x000000000070e814 in _D3std11parallelism8TaskPool16tryDeleteExecuteMFPSQBwQBv12AbstractTaskZv ()
#1  0x000000000059a2a5 in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf10yieldForceMFNcNdNeZQCz (this=0x2026261b0d812821)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:605
#2  0x0000000000599b0b in _D3std11parallelism__T4TaskS_D3bio4core4bgzf5block19decompressBgzfBlockFSQBrQBqQBoQBm9BgzfBlockCQCoQCn5utils7memoize__T5CacheTQCcTSQDxQDwQDuQDs21DecompressedBgzfBlockZQBwZQBpTQDzTQDgZQGf6__dtorMFNfZv (this=0x2026261b0d812821)
    at /gnu/store/lfj9sx1c98nj65vw8gmvz31sh3q8qhm6-ldc-1.16.0-beta2/include/d/std/parallelism.d:747
#3  0x000000000074064c in object.TypeInfo_Struct.destroy(void*) const ()
#4  0x000000000074a1f7 in rt.lifetime.finalize_array(void*, ulong, const(TypeInfo_Struct)) ()
#5  0x000000000074b456 in rt.lifetime.finalize_array2(void*, ulong) ()
#6  0x000000000074b74a in rt_finalizeFromGC ()
pjotrp commented 5 years ago

Just now, for the first time, sambamba ran without crashing :). The problem is an out of order execution.

pjotrp commented 5 years ago

I think the problem is with scopedtask where a task is added to the threadpool using the stack rather than the heap. In particular this section

https://github.com/biod/BioD/blob/master/bio/core/bgzf/inputstream.d#L391

where a task gets created and pushed on a roundbuf. When the garbage collector kicks in after reading the BAM file it wants to destroy this object but it is in an inconsistent state (maybe the thread already got cleaned up or it tries to clean up twice). I managed to prevent segfaulting by disabling the garbage collector, but obviously that won't do.

The roundbuf is probably used to keep a task connected with the bgzf unpacking buffer. I am not sure why this is necessary. Also I am not convinced a threadpool is that much of a benefit for bgzf unpacking as the single threaded routine I wrote last year is blazingly fast. Need to figure out what the best approach is...

pjotrp commented 5 years ago

Adding this code to the destructor which segfaults

struct DecompressedBgzfBlock {
    ~this() {
      stderr.writeln("destroy DecompressedBgzfBlock ",start_offset,":",end_offset," ",decompressed_data.sizeof);
    };
    ulong start_offset;
    ulong end_offset;
    ubyte[] decompressed_data;
}

Running a decompress block typically reads

destroy DecompressedBgzfBlock 4945:5091 16                                                                                             
destroy DecompressedBgzfBlock 4616:4945 16
destroy DecompressedBgzfBlock 4287:4616 16                                                                                             destroy DecompressedBgzfBlock 3958:4287 16
destroy DecompressedBgzfBlock 3629:3958 16                                                                                             
destroy DecompressedBgzfBlock 3300:3629 16
destroy DecompressedBgzfBlock 2971:3300 16                                                                                             
destroy DecompressedBgzfBlock 2642:2971 16
destroy DecompressedBgzfBlock 2313:2642 16                                                                                             destroy DecompressedBgzfBlock 1984:2313 16  
destroy DecompressedBgzfBlock 1655:1984 16
destroy DecompressedBgzfBlock 1326:1655 16
destroy DecompressedBgzfBlock 997:1326 16
destroy DecompressedBgzfBlock 668:997 16  
destroy DecompressedBgzfBlock 339:668 16  
destroy DecompressedBgzfBlock 0:339 16                                                                                                 

but before a segfault we get

destroy DecompressedBgzfBlock 3184080310709005360:3467820302580068384 16
destroy DecompressedBgzfBlock 3184080310725782576:3467820302580068384 16                                                               destroy DecompressedBgzfBlock 3467820298285101088:3467820319995076652 16
destroy DecompressedBgzfBlock 3184361785685716016:3539877896617996576 16                                                               destroy DecompressedBgzfBlock 2318280822927401004:3184080310725782576 16
destroy DecompressedBgzfBlock 3539878068416698400:2318281922439028780 16                                                               destroy DecompressedBgzfBlock 3467820302580068640:2318280822927401004 16
destroy DecompressedBgzfBlock 2318280822927401004:3184361785702493232 16                                                               destroy DecompressedBgzfBlock 3467820298285101088:2318280822927401004 16
destroy DecompressedBgzfBlock 3467820298285101088:2318281922439028780 16                                                               destroy DecompressedBgzfBlock 2318281922439028780:2318281922439360048 16
destroy DecompressedBgzfBlock 2318286380681145392:3184080310725782576 16
destroy DecompressedBgzfBlock 3467820298285101344:2318281922439028780 16
destroy DecompressedBgzfBlock 3539877892323029024:2318281922439094316 16
Program exited with code -11              
wrk@penguin2 ~/izip/git/opensource/D/sambamba/BioD [env]$

After making sure the start_offset is set to 0 it looks like these blocks have become invalid and the garbage collector still tries to clean them up.

pjotrp commented 5 years ago

Artem wrote:

Create a new task and put it on the roundbuffer using some magic

My conclusion is that that 'magic' no longer works. Creating a thread on the stack and moving it to the roundbuffer on the Heap confuses the garbage collector which is kinda unsurprising. With markdup I can't disable the GC cleanup so it needs some surgery.

ekg commented 5 years ago

We are having the same problem. Is the only solution at present to downgrade to 0.6.6?

pjotrp commented 5 years ago

It is one of these things that take a few days of work to fix. It is on my list :/

ekg commented 5 years ago

It might save some users a bit of time if you make a kind of warning binary release of 0.6.6-stable and drop it at the top of the release page.

I know how much of a pain it can be to track down stuff like this. We just had a big problem due to the Spectre/Meltdown patches changing the way that multithreaded interleaved system calls work. I wonder if this problem is related.

pjotrp commented 5 years ago

Actually the latest binary release 0.7.0 works.

https://github.com/biod/sambamba/releases/tag/v0.7.0

The problem is with later versions of the D compiler.