Kingsford-Group / splitsbt

GNU General Public License v3.0
18 stars 3 forks source link

Segfault mid build for certain sample sets #8

Open Phelimb opened 6 years ago

Phelimb commented 6 years ago

I've been running some benchmarking of SSBT ebd4e0bfecc966225e0298f8d25f7ef9c6c57422 and found that it errors on some sets of samples, but not on supersets of the same samples.

e.g. here is the tail of building 300 microbial samples.

At node: /ssd0/benchmark/results/split-split-rec/batches/300/blooms/ERR084147.sim.bf.bv
Saving BF to /ssd0/benchmark/results/split-split-rec/batches/300/blooms/ERR1519879_union.sim.bf.bv
Saved with size: 456905777 456905777
Inserting leaf /ssd0/benchmark/results/split-split-rec/batches/300/blooms/ERR1541864.sim.bf.bv ...
At node: /ssd0/benchmark/results/split-split-rec/batches/300/blooms/DRR018016_union.sim.bf.bv
Load Size: 456905777 456905777
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

However, 400 samples, which include the same 300 as above, with 100 new samples builds successfully.

The same samples built with SBT, or with SSBT but with different bloom filter parameters also build successfully.

The same is also observed with 800 samples (so it's not a single case at N=300)

Inserting leaf /ssd0/benchmark/results/split-split-rec/batches/800/blooms/ERR039622.sim.bf.bv ...
At node: /ssd0/benchmark/results/split-split-rec/batches/800/blooms/DRR013329_union.sim.bf.bv
Load Size: 869917422 869917422
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Bradsol commented 6 years ago

Can you send me the entire log file?

Phelimb commented 6 years ago

300_samples_stderr.txt

Phelimb commented 6 years ago

Here's a successful output to contrast 400_samples_stderr.txt

Bradsol commented 6 years ago

I've seen something like this before with my own operations using a multi-threaded Slurm job manager. This occurred when only some of the threads had access to the disk in question (a larger flaw in the cluster I was working on) and the fact that this appears to be a fairly stochastic problem seems more in line with a hardware problem then a code issue.

My two questions: 1) If you rerun the same set which failed, will it fail at the exact same place? 2) Are you using a job manager to process these jobs and are the failed jobs using particular nodes?

If rerunning it on a different node works // there is a systematic problem in hardware space, let me know. I'm going to look through the code on the assumption that it's a coding issue until I hear back.