GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

Segfault #31

Closed Ritu-Kundu closed 5 years ago

Ritu-Kundu commented 6 years ago

The tool aborted with a segmentation fault. Ran on the high coverage dataset "SRR1298980_1.fastq.gz" (paired end with SRR1298980_2.fastq.gz) from 1000genomes project. Following are the details -

Command: ./bcalm -in SRR1298980 -kmer-size 41 -abundance-min 4

Back trace (using gdb):

0 0x0000000000881db5 in gatb::core::tools::dp::IDispatcher::IteratorCommand<gatb::core::bank::Sequence, void gatb::core::debruijn::impl::bglue<64ul>(gatb::core::tools::storage::impl::Storage*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, int, int, bool)::{lambda(gatb::core::bank::Sequence const&)#1}>::execute() ()

1 0x0000000000792d35 in gatb::core::tools::dp::impl::Dispatcher::mainloop(void*) ()

2 0x00007ffff7bc16ba in start_thread (arg=0x7fffd2639700) at pthread_create.c:333

3 0x00007ffff6c3841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Output (from a previous run on the same file with the same parameters using nohup): output.txt attached

Many thanks & best wishes, Ritu

rchikhi commented 6 years ago

Hi Ritu, thanks for reporting this bug, I'm on it. Did it crash on the same machine on a smaller dataset?

Ritu-Kundu commented 6 years ago

Yes, the same machine. This data-set has worked for other parameters like k=51, abundance=4; k=41 abundance=2 etc.

Ritu-Kundu commented 6 years ago

Hi Rayan,

I gave it both as an input _1 and _2.

Best wishes, Ritu

On Mon, Mar 5, 2018 at 1:34 PM, Rayan Chikhi notifications@github.com wrote:

Hi Ritu, I ran Bcalm on the file SRR1298980_1.fastq, k=41, abundance min 4, and it gave me a different number of distinct k-mers than your output. Can you confirm that you gave it only the _1 file, or maybe both? Here are my filesizes (uncompressed):

$ du -B1 ../*.fastq 60621176832 ../SRR1298980_1.fastq 60621176832 ../SRR1298980_2.fastq

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/GATB/bcalm/issues/31#issuecomment-370420075, or mute the thread https://github.com/notifications/unsubscribe-auth/AFqJJXtGmBIz8n9J5GYOwFcG2r5_8eyFks5tbT7MgaJpZM4SYfny .

-- Best wishes, Ritu

rchikhi commented 6 years ago

Thanks (I actually deleted the original comment because I figured that I'd also test with _1 and _2). In fact, the test just finished and it completed without a crash. The same number of distinct kmers were found as the one you report in the output, so we indeed have the same data. 1) Which version of Bcalm are you using, is it 2.2.0? 2) Is the crash reproducible on your side? (re-running it with exactly the same parameters) 3) Could you try with less threads? (e.g .24)

Ritu-Kundu commented 6 years ago

Hi Rayan,

Apologies for the late response. I did try it several times. Each time it ends with a segfault. I could not try with restricting the number of threads. I am not sure how to do that (I do not have su privileges on the machine).

Attaching the output of the recent run.

bcalm_segfault_SRR1298980_k41_a4.txt Backtrace in gdb is as follows:

0 0x0000000000881db5 in gatb::core::tools::dp::IDispatcher::IteratorCommand<gatb::core::bank::Sequence, void gatb::core::debruijn::impl::bglue<64ul>(gatb::core::tools::storage::impl::Storage*, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, int, int, bool)::{lambda(gatb::core::bank::Sequence const&)#1}>::execute() ()

1 0x0000000000792d35 in gatb::core::tools::dp::impl::Dispatcher::mainloop(void*) ()

2 0x00007ffff7bc16ba in start_thread (arg=0x7fffe2930700) at pthread_create.c:333

3 0x00007ffff6c3841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

rchikhi commented 6 years ago

Hi Ritu, thanks for following up. To limit the number of threads, just run bcalm with -nb-cores 10 (for example to limit to 10). But given that the problem happened several times to you, I am not too hopeful that this tip will solve it.. EDIT: original message contained a mistake in the parameter name, it's -nb-cores not -nb-threads

Ritu-Kundu commented 6 years ago

Hi Rayan,

It says : Unknown parameter '-nb-threads'

I was running it with -nb-cores 12 and then serially. Both types of runs resulted in Segfault. On the other hand, the tool is running perfectly on the same files with other parameters.

rchikhi commented 6 years ago

Hi Ritu,

Oops, it was indeed -nb-cores. Hmm.. so this isn't a threading problem, and I still could not reproduce it.. One last thing, can we make sure we have the exact same files? $ md5sum *.gz 1afcb859413c50817e5bb35f2ba54214 SRR1298980_1.fastq.gz 23536204b36c4ddc09c8c589d6697f38 SRR1298980_2.fastq.gz

Ritu-Kundu commented 6 years ago

Hi Rayan,

My files seem to be different. d6b2931eb5db6543ab4697e12ce67062 SRR1298980_1.fastq.gz c56db0039e79bf62940feadc899e076d SRR1298980_2.fastq.gz

Let me download them again and have a re-run. Will update the results.

Malfoy commented 6 years ago

Hi, Just a quick remark, I just pushed a correction into bcalm that removed a possible segfault on some rare events. It could possibly solve the problem you are facing.

Antoine Limasset

2018-04-10 12:24 GMT+02:00 Rayan Chikhi notifications@github.com:

Hi Ritu,

Oops, it was indeed -nb-cores. Hmm.. so this isn't a threading problem, and I still could not reproduce it.. One last thing, can we make sure we have the exact same files? $ md5sum *.gz 1afcb859413c50817e5bb35f2ba54214 SRR1298980_1.fastq.gz 23536204b36c4ddc09c8c589d6697f38 SRR1298980_2.fastq.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GATB/bcalm/issues/31#issuecomment-380050709, or mute the thread https://github.com/notifications/unsubscribe-auth/AE9V_p_k-3b8yQiKRliN415XNkQorZwHks5tnIhOgaJpZM4SYfny .

Ritu-Kundu commented 6 years ago

Hi Antoine,

Thanks for the update. I will try the patched version with the same files first and if it doesn't work, then will try the tool on re-downloaded files. Will be back with the update.

Ritu-Kundu commented 6 years ago

Hi Rayan and Antoine,

The issue still persists. I have used the updated version of the tool. To summarise the complete procedure, following are the commands that were run:

Have downloaded the files using:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/000/SRR1298980/SRR1298980_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR129/000/SRR1298980/SRR1298980_2.fastq.gz md5sum is d6b2931eb5db6543ab4697e12ce67062 SRR1298980_1.fastq.gz c56db0039e79bf62940feadc899e076d SRR1298980_2.fastq.gz

Tu run the tool on these paired_end files, created SRR1298980 using ls -1 *.fastq.gz > SRR1298980

Finally, ran the tool with ./tools/bcalm/build/bcalm -in SRR1298980 -kmer-size 41 -abundance-min 4

It ended up with segmentation fault. The output is attached.

serial.txt

rchikhi commented 6 years ago

Dear Ritu, I tried again on a different, fresh machine. It also didn't crash. In light of another user having issues with specific systems, I'd like to ask you: 1) Could you please try compiling bcalm on a different machine, and attempt to reproduce the bug on your side? 2) What is the output of the cmake .. command? (it will show the compiler version) as well as uname -ar

Best, Rayan

rchikhi commented 6 years ago

By the way, I found out why we don't have the same md5 values. fastq-dump from SRA changes the file header compared to downloading it directly from ftp:

@SRR1298980.1 1/1

vs

@SRR1298980.1 1 length=250
Ritu-Kundu commented 6 years ago

Hi Rayan,

Apologies for late reply. I tried running the tool on a different machine (running Arch Linux) with the same files and parameters. It worked perfectly fine.

  1. Output of uname -ar Linux ----- 4.10.0-38-generic #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Result of cmake .. is attached. cmake_result.txt

Thanks for explaining the reason for different md5sum of the files. As it is difficult to reproduce the problem, I am closing the issue. Nevertheless, thanks for your responses.

rchikhi commented 6 years ago

Hmm glad to hear it worked on a different machine. I'll keep an eye if that bug appears again.. For my own reference, this is a segfault that occurs right after "UF MPHF constructed"

leoisl commented 5 years ago

Hello there!

I am porting KisSplice to use BCALM2, it seems a lot faster, thanks a lot for this implementation! However, in a large test (16 readfiles) I have exactly the same bug. I'm willing to debug this though, as BCALM2 can make KisSplice's pipeline way faster. I'll first try some ideas of what I think can be the bug, as it seems to be specific to some machines, and keep you updated. I guess this probably won't work out, then I can compile and run with debug flags and maybe provide you a tmate session. Tell me if interested!

Thanks! Regards.

rchikhi commented 5 years ago

Hi Leandro,

Yes that'd be great. Can you reproduce the problem somewhat consistently?

Rayan

leoisl commented 5 years ago

Hello Rayan,

Yeah, I can, but takes time. I'd like to find a MWE that would reproduce the bug, since my input (16 readfiles) and Ritu's input (2 heavy readfiles) are heavy and takes some time for testing. Probably will take huge time for running the exec with debug flags... We'll see!

rchikhi commented 5 years ago

Usually running with gdb and debug flags isn't that bad, just a small slowdown compared to running in release mode. But it'd be quite nice to get a backtrace with line numbers.. Indeed best would be a MWE by downsampling the dataset.

leoisl commented 5 years ago

Well, I'm on it, but the slow down seems significant...

[DSK: Pass 1/2, Step 2: counting kmers   ]  26.1 %   elapsed: 6111 min 42 sec   remaining: 17321 min 22 sec   cpu: 1899.8 %   mem: [6370, 6370, 6370] MB

However, I launched gdb on the whole dataset (16 read files). I was really hoping that I would be able to reproduce the bug with a downsample in the meantime and relaunch gdb... weirdly enough, bcalm worked with 1, 2, 4, 8, 12 and even 15 reads. Only with all 16 read files it bugs. I don't think it is the # of read files, as I think this bug is the same as Ritu's, who had 2 read files. Bcalm also worked running it only with the last read file, so it is probably not due to unexpected read bases/format in the input.

This is the command line: bcalm -in all_read_filenames -kmer-size 41 -abundance-min 2 -nb-cores 6 -out bcalm_out. Also tried with -no-mphf, no success. This is the output:

...
Sum of CPU times for bucket compactions: 5761.0 secs
BCALM total wallclock (excl kmer counting): 1607.6 secs
Maximum number of kmers in a subgraph: 443898
Performance of compaction step:

                 Wallclock time spent in parallel section : 860.9 secs
             Best theoretical speedup in parallel section : 56.6x
Best theor. speedup in parallel section using 1 threads : 1.0x
             Sum of longest bucket compaction for each sb : 156.4 secs
                       Sum of best scheduling for each sb : 5014.9 secs
Done with all compactions                       09:48:45     memory [current, maxRSS]: [3614, 7925] MB
bglue_algo params, prefix:bcalm_out.unitigs.fa k:41 threads:6
Starting bglue with 6 threads                   09:48:45     memory [current, maxRSS]: [3222, 7925] MB
number of sequences to be glued: 203231274        09:48:45     memory [current, maxRSS]: [3222, 7925] MB
71210596 marked kmers, 192803850 unmarked kmers        09:51:06     memory [current, maxRSS]: [3766, 7925] MB
created vector of hashes, size approx 543 MB)        09:51:06     memory [current, maxRSS]: [3766, 7925] MB
pass 1/3, 35604286 unique hashes written to disk, size 271 MB        09:51:13     memory [current, maxRSS]: [3222, 7925] MB
71221960 marked kmers, 192803850 unmarked kmers        09:53:39     memory [current, maxRSS]: [3766, 7925] MB
created vector of hashes, size approx 543 MB)        09:53:39     memory [current, maxRSS]: [3766, 7925] MB
pass 2/3, 35610015 unique hashes written to disk, size 271 MB        09:53:47     memory [current, maxRSS]: [3222, 7925] MB
71226142 marked kmers, 192803850 unmarked kmers        09:56:07     memory [current, maxRSS]: [3766, 7925] MB
created vector of hashes, size approx 543 MB)        09:56:07     memory [current, maxRSS]: [3766, 7925] MB
pass 3/3, 35612115 unique hashes written to disk, size 271 MB        09:56:12     memory [current, maxRSS]: [3222, 7925] MB
loaded all unique UF elements (106826416) into a single vector of size 815 MB        09:56:13     memory [current, maxRSS]: [4037, 7925] MB
[Building BooPHF]  100  %   elapsed:   0 min 32 sec   remaining:   0 min 0  sec
Bitarray       503178624  bits (100.00 %)   (array + ranks )
final hash             0  bits (0.00 %) (nb in final hash 0)
UF MPHF constructed (59 MB)                     09:56:45     memory [current, maxRSS]: [3222, 7925] MB
Command terminated by signal 11

Such a cryptic bug, any idea of what could be and what should I try?

Otherwise, let's just wait gdb finishes and we can know exactly what it is...

rchikhi commented 5 years ago

Ouch.. so will need to wait around 10 days until we get a stack trace with line numbers.. I'm ready to wait for that. I recall looking for that bug earlier but didn't immediately find it.

leoisl commented 5 years ago

It is going faster than expected I guess, already >50% of DSK pass 2:

[DSK: Pass 2/2, Step 1: partitioning     ]  58   %   elapsed: 2088 min 26 sec   remaining: 1512 min 18 sec   cpu: 1998.7 %   mem[DSK: Pass 2/2, Step 1: partitioning     ]  59   %   elapsed: 2225 min 3  sec   remaining: 1546 min 13 sec   cpu: 1998.7 %   mem: [2254, 2254, 6868] MB

If it helps you, I have the complete output and error log of a previous run without debug flags:

Error:

[Approximating frequencies of minimizers ]  100  %   elapsed:   2 min 46 sec   remaining:   0 min 0  sec   cpu: 100.0 %   mem: [  30,   30,   30] MB
[DSK: nb solid kmers found : 1016313566  ]  134  %   elapsed: 363 min 51 sec   remaining:   0 min 0  sec   cpu: 273.5 %   mem: [3849, 8838, 8838] MB
[Iterating DSK partitions                ]  100  %   elapsed:  17 min 15 sec   remaining:   0 min 0  sec
[Building BooPHF]  100  %   elapsed:   0 min 17 sec   remaining:   0 min 0  sec

Output: out.txt

Otherwise, let's just wait!

Thanks!

rchikhi commented 5 years ago

Thanks. The output is same as before: crashed right after MPHF is constructed.

rchikhi commented 5 years ago

Update: Alex Di Genova found a smaller dataset where the crash also occurs: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/002/ERR2173372/ERR2173372_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR217/002/ERR2173372/ERR2173372_2.fastq.gz It's gonna be easier to get a stack trace

leoisl commented 5 years ago

Ok, great! Got confused by your edit :p ... are you debugging it or can I do it?

rchikhi commented 5 years ago

I'll debug. If you have a stacktrace on your run (the one that took a long time) please still send it, to confirm that it is in fact the same bug

leoisl commented 5 years ago

Ok, but I don't really know how much time it will take, since there seems to have a minor bug with the progress log:

[DSK: Pass 2/2, Step 2: counting kmers   ]  123  %   elapsed: 9382 min 58 sec   remaining:   0 min 0  sec   cpu: 1518.2 %   mem: [7465, 7468, 7473] MB
rchikhi commented 5 years ago

Yeah it may exceed 100% as the total number of kmers is impossible to estimate accurately at first. (and our estimation is very inaccurate :)

rchikhi commented 5 years ago

I've got a stack trace!

[..]
#5  0x00005555564b85cb in gatb::core::debruijn::impl::bglue<64ul>(gatb::core::tools::storage::impl::Storage*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, int, bool)::{lambda(gatb::core::bank::Sequence const&)#1}::operator()(gatb::core::bank::Sequence const&) const (this=0x5555599a4a60, sequence=...)
    at /home/rayan/bcalm/gatb-core/gatb-core/src/gatb/bcalm2/bglue_algo.cpp:821
(gdb) p v2
$5 = 4294967295 [ie. -1]
[..]

This is a MPHF query returning that the queried kmer is absent, however it's not supposed to be absent. Will investigate further.

leoisl commented 5 years ago

Cool!

I will check if it is the same issue once I get it. Looking forward for the fix!!

leoisl commented 5 years ago

ha! got the stacktrace... it seems to be the same error:

Thread 72528 "bcalm" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff33fd700 (LWP 45205)]
0x000055555641a8c9 in std::__atomic_base<unsigned long>::load (__m=std::memory_order_seq_cst, this=0x8007210fa008) at /usr/include/c++/7/bits/atomic_base.h:396
396             return __atomic_load_n(&_M_i, __m);
(gdb) bt
#0  0x000055555641a8c9 in std::__atomic_base<unsigned long>::load (__m=std::memory_order_seq_cst, this=0x8007210fa008) at /usr/include/c++/7/bits/atomic_base.h:396
#1  std::__atomic_base<unsigned long>::operator unsigned long (this=0x8007210fa008) at /usr/include/c++/7/bits/atomic_base.h:259
#2  0x000055555640d57b in unionFind::parent (this=0x7fffffff9590, id=4294967295)
    at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/bcalm2/unionFind.hpp:106
#3  0x000055555640d0d0 in unionFind::find (this=0x7fffffff9590, id=4294967295)
    at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/bcalm2/unionFind.hpp:35
#4  0x000055555640d223 in unionFind::union_ (this=0x7fffffff9590, id1=4294967295, id2=4294967295)
    at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/bcalm2/unionFind.hpp:61
#5  0x000055555649b7d6 in void gatb::core::debruijn::impl::bglue<64ul>(gatb::core::tools::storage::impl::Storage*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, int, bool)::{lambda(gatb::core::bank::Sequence const&)#1}::operator()(gatb::core::bank::Sequence const&) const (
    __closure=0x555557340d80, sequence=...) at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/bcalm2/bglue_algo.cpp:821
#6  0x00005555564b606a in gatb::core::tools::dp::IDispatcher::IteratorCommand<gatb::core::bank::Sequence, void gatb::core::debruijn::impl::bglue<64ul>(gatb::core::tools::storage::impl::Storage*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, int, bool)::{lambda(gatb::core::bank::Sequence const&)#1}>::execute() (this=0x55555a66f3d0) at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/tools/designpattern/api/ICommand.hpp:324
#7  0x00005555563b6921 in gatb::core::tools::dp::impl::CommandStartSynchro::execute (this=0x55555a81bbd0)
    at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/tools/designpattern/impl/Command.cpp:52
#8  0x00005555563b63a4 in gatb::core::tools::dp::impl::Dispatcher::mainloop (data=0x55555a81bc00)
    at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/tools/designpattern/impl/Command.cpp:219
#9  0x00007ffff7f8efa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#10 0x00007ffff758388f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) frame 5
#5  0x000055555649b7d6 in void gatb::core::debruijn::impl::bglue<64ul>(gatb::core::tools::storage::impl::Storage*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, int, bool)::{lambda(gatb::core::bank::Sequence const&)#1}::operator()(gatb::core::bank::Sequence const&) const (
    __closure=0x555557340d80, sequence=...) at /newdisk/leandro/kissplice_3_0_0/kissplice/bcalm/gatb-core/gatb-core/src/gatb/bcalm2/bglue_algo.cpp:821
821             ufkmers.union_(v1,v2);
(gdb) p v1
$1 = 4294967295
(gdb) p v2
$2 = 4294967295
rchikhi commented 5 years ago

Excellent, thanks for the confirmation!

rchikhi commented 5 years ago

Found the root cause: it's a k-mer whose hash value is 0, so, quite rare. I'll find a way to fix it.

leoisl commented 5 years ago

oh, pardon my ignorance, but that leaves me even more confused...

From what I see from the paper, there seems to be no problem with kmers hashing to 0 in any of the internal hash functions of the MPHF. I was wondering if the problem was because the MPHF hashes to [1, N] (as stated in the paper), and then if a k-mer hashes to 0, then we might have a problem... But I think it is just nicer to present in the paper the [1, N] interval, and the real implementation of MPHF would hash to [0, N-1]... And I could indeed confirm that graph.nodeMPHFIndex() can hash to 0 without any issue...

Don't worry about explaining me the bug if the comprehensive explanation is far more detailed and complicated, just point me to your commit when you solve the bug and I will try to understand from the diff!

thanks a bunch!

rchikhi commented 5 years ago

Sorry @leoisl, I didn't want to confuse you. My previous message was just to keep you updated.. you didn't need to do anything. I've committed a tentative fix, could you please test it? (no need to compile in debug mode)

leoisl commented 5 years ago

Hello Rayan!

Yeah, I understood that your previous message was just to keep me updated, but I also got interested by the cause of the bug :p ... but understood better your message now seeing your commits!

Testing, thanks!

leoisl commented 5 years ago

Hello Rayan!

It works, thanks a lot!

For me, the issue is solved!

Thanks again!

leoisl commented 5 years ago

hello, I can't close this issue, but it seems solved!