biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
557 stars 104 forks source link

Merge sometimes hangs and never merges #429

Closed calizilla closed 4 years ago

calizilla commented 4 years ago

The problem: trying to merge 3367 contig-level BAMs recalibrated with GATK BQSR into a final BAM per sample. I have used this method often, across multiple datasets and clusters. Recently I encountered a dataset where none of the 48 samples (~ 30 - 60X) would merge with sambamba - the job would run indefinitely, using very little CPU, and produce no output, ever, or any error messages. I resorted to merging these BAMs with GATK GatherBamFiles which worked perfectly. I prefer to use SAMbamba as it is a) faster, and b) produces smaller BAM files (on the order of 100 GB smaller for the current dataset). For the current dataset, I have merged 79 of 80 samples successfully with SAMbamba, within a parallel job. They were all pocessed with exactly the same pipeline, and are of similar coverage (60X). One of the samples would not merge, see stack trace below. I resubmitted this sample as a single task, and the same thing occurred, the job runs but does nothing.

It has 25 threads, but all bar the first are waiting on a condition variable, e.g.:

Thread 25 (LWP 92301):

0 0x00000000005c8d8b in pthread_cond_wait ()

1 0x0000000000609de0 in core.sync.condition.Condition.wait() ()

2 0x00000000005ef7c6 in _D3std11parallelism8TaskPool3popMFZPSQBjQBi12AbstractTask ()

3 0x00000000005ef70f in std.parallelism.TaskPool.startWorkLoop() ()

4 0x000000000060a87f in thread_entryPoint ()

5 0x00000000005c5f39 in start_thread ()

6 0x000000000068fb3b in clone ()

The first thread has this stack trace:

0 0x00000000004a7d72 in _D3std9algorithm9searching__T4findVAyaa6_61203d3d2062TSQCbQCa9iterationT9MapResultSQDf10functionalT8unaryFunVQDaa28_7475706c6528615b305d2e6964656e7469666965722c20615b315d29VQFma1_61ZQDfTASQHi8typeconsT5TupleTS3bioQIj3hts3sam6header9mixin326PgLineTmZQByZQHfTSQKeQCw__TQCqTQJkTmZQDaZQKaFNaNbNiNfQJnMQBoZQJv (needle=..., haystack=...) at /gnu/store/gknikj4p96npmaf76wl91l39s94cslpp-ldc-1.10.0/include/d/std/typecons.d:763

1 _D3std9algorithm9searchingT7canFindZTQmTSQBrQBq9iterationT9MapResultSQCv10functionalT8unaryFunVAyaa28_7475706c6528615b305d2e6964656e7469666965722c20615b315d29VQCma1_61ZQDfTASQGy8typeconsT5TupleTS3bioQHz3hts3sam6header9mixin326PgLineTmZQByZQHfTSQJuQCw__TQCqTQGkTmZQDaZQJqFNaNbNiNfQJnMQBoZb (needle=..., haystack=...) at /gnu/store/gknikj4p96npmaf76wl91l39s94cslpp-ldc-1.10.0/include/d/std/algorithm/searching.d:2504

2 _D3bio3std3hts5utils15samheadermerger15SamHeaderMerger19mergeProgramRecordsMFZ9lambda1MFNaNbNiNfSQDp8typeconsT5TupleTSQEqQEpQEo3sam6header9__mixin326PgLineTmZQBwZb (a=...) at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/utils/samheadermerger.d:272

3 _D3std9algorithm7sorting__T9partitionS_D3bioQBq3hts5utils15samheadermerger15SamHeaderMerger19mergeProgramRecordsMFZ9lambda1MFNaNbNiNfSQFe8typeconsT5TupleTSQEpQGeQEo3sam6header9__mixin326PgLineTmZQBwZbVEQHwQHv8mutation12SwapStrategyi0TAQEaZQIiMFNaNbNiNfQsZQv (r=...) at /gnu/store/gknikj4p96npmaf76wl91l39s94cslpp-ldc-1.10.0/include/d/std/algorithm/sorting.d:460

4 0x00000000004a6a0f in bio.std.hts.utils.samheadermerger.SamHeaderMerger.mergeProgramRecords() (this=0x1512d9e4c600) at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/utils/samheadermerger.d:271

5 0x000000000046e08b in _D3bio3std3hts5utils15samheadermerger15SamHeaderMerger6__ctorMFACQClQCkQCj3sam6header9SamHeaderbZCQDsQDrQDqQDpQDmQCy (this=0x1512d9e4c600, headers=..., validate_headers=) at /home/wrk/izip/git/opensource/D/sambamba/BioD/bio/std/hts/utils/samheadermerger.d:83

6 0x000000000046d086 in sambamba.merge.merge_main(immutable(char)[][]) (args=...) at /home/wrk/izip/git/opensource/D/sambamba/sambamba/merge.d:347

7 0x0000000000614160 in rt.dmain2._d_run_main(int, char**, extern(C) int(char[][]) function).runAll() ()

8 0x0000000000614056 in _d_run_main ()

9 0x0000000000633790 in __libc_start_main ()

10 0x0000000000400c9a in _start ()

It wasl doing stuff - the stack trace changed every now and then - , but as to what it was actually doing I don't know , as no output was being written.

Thanks for your time, Cali

pjotrp commented 4 years ago

@calliza you are certainly pushing tools! I don't know what you can share, but if I can have a go at reproducing the problem I can try debugging it. Otherwise it is virtually impossible to debug multithreaded programs. See for example https://thebird.nl/blog/work/rotate.html#org6ad44f1

calizilla commented 4 years ago

@pjotrp I am trying to get the most out of the cluster :-) Unfortunately it is human clinical data so I can't share it. If I come by a shareable sample that is subject to the same issue, I will share. I find it most bizarre that the issue occurs for some samples and not others, and that coverage and data quality has nothing to do with it. If I run GATK ApplyBQSR over fewer scattered evenly-sized intervals (eg 96 is a good size) SAMbamba will happily merge. But the issue is then that SAMbamba merges the split BAMs without discarding the read records that are replicated in 2 abutting intervals because they overlap the junction of the split interval and so GATK printed them in each interval. Sorry I realise this is a separate thing to the original ticket, but it would help me solve the problem. If I scatter the intervals without splitting contigs to overcome the replicated reads issue, the level of parallelisation possible goes down and walltime goes up, so it would be great if SAMbamba could have an option to discard these reads. Note that I am using the term 'replicate' read and not 'duplicate' read, to avoid confusion between PCR/optical duplicates. Thanks for your time

pjotrp commented 4 years ago

Yes, do send me something when it is replicated.