algbio / themisto

Space-efficient pseudoalignment with a colored de Bruijn graph
GNU General Public License v2.0
50 stars 4 forks source link

stochastic bug causes themisto build to hang indefinitely (themisto_linux-v3.2.1) #40

Open rohanmaddamsetti opened 1 month ago

rohanmaddamsetti commented 1 month ago

Hello,

This bug only happens in ~1-10 out of ~5000 themisto build runs. I am running themisto on ~4500 genomes, calling themisto build on each of these genomes separately, using HPC to schedule them in parallel. Sometimes I see ~5 runs hang indefinitely, most recently I saw 1 run hang indefinitely. If I re-run a failed run, themisto finishes normally. So this bug does not seem to be caused by the specific data, and it occurs in ~ 0.02% of themisto build runs.

Pure speculation on my part, but perhaps caused by some kind of rare race condition?

Here is how I am calling themisto on HPC:

sbatch -p scavenger --mem=2G --cpus-per-task=4 --wrap="themisto build -k 31 -i ../results/themisto_replicon_references/GCF_017165095.1_ASM1716509v1_genomic/GCF_017165095.1_ASM1716509v1_genomic.txt --index-prefix ../results/themisto_replicon_indices/GCF_017165095.1_ASM1716509v1_genomic --temp-dir ../results/themisto_replicon_indices/temp --mem-gigas 2 --n-threads 4 --file-colors"

9 hours later, this is what the log file looked like: failed-slurm-11363085.out.txt

When I rerun this command, the run finishes in 8 seconds, here is the log file: rerun-slurm-11372781.out.txt

iosfwd commented 1 month ago

Have you tried to allocate more memory to your batch job? 2G might not be cutting it, since there could be some memory overhead on top of what is passed as an option to Themisto, which in your case is "--mem-gigas 2".

rohanmaddamsetti commented 3 weeks ago

Yes, I have tried this, allocating 6GB per job. I see another stochastic failure (1 out of ~5000 runs), where themisto hangs indefinitely. Here is the output (hung for 6 hours):

22.6940 Mon Jul 1 13:15:20 2024 Themisto-v3.2.1 22.8120 Mon Jul 1 13:15:20 2024 Maximum k-mer length (size of the de Bruijn graph node labels): 32 1031.4840 Mon Jul 1 13:15:21 2024 Build configuration: Sequence file = ../results/themisto_replicon_references/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.txt Index de Bruijn graph output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tdbg Index coloring output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tcolors Temporary directory = ../results/themisto_replicon_indices/temp k = 31 Reverse complements = true Number of threads = 4 Memory gigabytes = 6 Manual colors = false Sequence colors = false File colors = true Load DBG = false Handling of non-ACGT characters = delete Coloring structure type: sdsl-hybrid Verbosity = normal 1036.5070 Mon Jul 1 13:15:21 2024 Starting 1038.7190 Mon Jul 1 13:15:21 2024 Running GGCAT Allocator initialized: mem: 6 GiB chunks: 24576 log2: 18 Started phase: reads bucketing prev stats: Temp buckets files size: 2.01 MiB Finished phase: reads bucketing. phase duration: 1.10s gtime: 1.39s Started phase: kmers merge prev stats: Total color subsets: 3 Finished phase: kmers merge. phase duration: 1.11s gtime: 2.50s Started phase: hashes sorting prev stats: Finished phase: hashes sorting. phase duration: 16.56s gtime: 19.06s Started phase: links compaction prev stats: Completed compaction with 30 iters Finished phase: links compaction. phase duration: 612.27ms gtime: 19.67s Started phase: reads reorganization prev stats: Finished phase: reads reorganization. phase duration: 64.54ms gtime: 19.73s Started phase: unitigs building prev stats: Finished phase: unitigs building. phase duration: 215.84ms gtime: 19.95s Started phase: maximal unitigs links building [step 1] prev stats: Finished phase: maximal unitigs links building [step 1]. phase duration: 7.69ms gtime: 19.96s Started phase: maximal unitigs links building [step 2] prev stats: Finished phase: maximal unitigs links building [step 2]. phase duration: 8.58ms gtime: 19.97s Started phase: maximal unitigs links building [step 3] prev stats: Finished phase: maximal unitigs links building [step 3]. phase duration: 209.29ms gtime: 20.18s Compacted De Bruijn graph construction completed. TOTAL TIME: 20.18s Final stats: phase: reads bucketing => 1.10s phase: kmers merge => 1.11s phase: hashes sorting => 16.56s phase: links compaction => 612.27ms phase: reads reorganization => 64.54ms phase: unitigs building => 215.84ms phase: maximal unitigs links building [step 1] => 7.69ms phase: maximal unitigs links building [step 2] => 8.58ms phase: maximal unitigs links building [step 3] => 209.29ms 25856.8550 Mon Jul 1 13:15:46 2024 Building SBWT 25856.9900 Mon Jul 1 13:15:46 2024 Running KMC counter


Stage 1: 100% Stage 2: 0%

(end of log, themisto got stuck here)

I then re-ran the failed job, with exactly the same parameters (6GB of memory), and it finished in 10 seconds or so:

28.1510 Mon Jul 1 20:20:21 2024 Themisto-v3.2.1 28.3550 Mon Jul 1 20:20:21 2024 Maximum k-mer length (size of the de Bruijn graph node labels): 32 83.0810 Mon Jul 1 20:20:21 2024 Build configuration: Sequence file = ../results/themisto_replicon_references/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.txt Index de Bruijn graph output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tdbg Index coloring output file = ../results/themisto_replicon_indices/GCF_013122055.1_ASM1312205v1_genomic/GCF_013122055.1_ASM1312205v1_genomic.tcolors Temporary directory = ../results/themisto_replicon_indices/temp k = 31 Reverse complements = true Number of threads = 4 Memory gigabytes = 6 Manual colors = false Sequence colors = false File colors = true Load DBG = false Handling of non-ACGT characters = delete Coloring structure type: sdsl-hybrid Verbosity = normal 84.0220 Mon Jul 1 20:20:21 2024 Starting 84.4360 Mon Jul 1 20:20:21 2024 Running GGCAT Allocator initialized: mem: 6 GiB chunks: 24576 log2: 18 Started phase: reads bucketing prev stats: Temp buckets files size: 2.01 MiB Finished phase: reads bucketing. phase duration: 915.31ms gtime: 918.78ms Started phase: kmers merge prev stats: Total color subsets: 3 Finished phase: kmers merge. phase duration: 535.29ms gtime: 1.45s Started phase: hashes sorting prev stats: Finished phase: hashes sorting. phase duration: 641.58ms gtime: 2.10s Started phase: links compaction prev stats: Completed compaction with 30 iters Finished phase: links compaction. phase duration: 615.56ms gtime: 2.71s Started phase: reads reorganization prev stats: Finished phase: reads reorganization. phase duration: 118.70ms gtime: 2.83s Started phase: unitigs building prev stats: Finished phase: unitigs building. phase duration: 36.97ms gtime: 2.87s Started phase: maximal unitigs links building [step 1] prev stats: Finished phase: maximal unitigs links building [step 1]. phase duration: 14.54ms gtime: 2.88s Started phase: maximal unitigs links building [step 2] prev stats: Finished phase: maximal unitigs links building [step 2]. phase duration: 16.37ms gtime: 2.90s Started phase: maximal unitigs links building [step 3] prev stats: Finished phase: maximal unitigs links building [step 3]. phase duration: 136.07ms gtime: 3.03s Compacted De Bruijn graph construction completed. TOTAL TIME: 3.03s Final stats: phase: reads bucketing => 915.31ms phase: kmers merge => 535.29ms phase: hashes sorting => 641.58ms phase: links compaction => 615.56ms phase: reads reorganization => 118.70ms phase: unitigs building => 36.97ms phase: maximal unitigs links building [step 1] => 14.54ms phase: maximal unitigs links building [step 2] => 16.37ms phase: maximal unitigs links building [step 3] => 136.07ms 3338.1230 Mon Jul 1 20:20:25 2024 Building SBWT 3338.2360 Mon Jul 1 20:20:25 2024 Running KMC counter


Stage 1: 100% Stage 2: 100%

6634.8530 Mon Jul 1 20:20:28 2024 Sorting KMC database in1: 100% 7524.0370 Mon Jul 1 20:20:29 2024 Writing nodes and dummies to disk 7608.0230 Mon Jul 1 20:20:29 2024 Uncompressing KMC database to disk 9071.2700 Mon Jul 1 20:20:30 2024 Streaming 9843.0130 Mon Jul 1 20:20:31 2024 Sorting dummies on disk 9938.6730 Mon Jul 1 20:20:31 2024 Merging sorted streams 10418.3910 Mon Jul 1 20:20:32 2024 Building SBWT structure 10443.9400 Mon Jul 1 20:20:32 2024 Building de Bruijn Graph finished (3283326 k-mers) 10444.0200 Mon Jul 1 20:20:32 2024 Building color structure 10444.0380 Mon Jul 1 20:20:32 2024 Building the color mapping Started phase: unitigs reorganization prev stats: Finished phase: unitigs reorganization. phase duration: 122.26ms gtime: 123.27ms Started phase: colormap reading prev stats: 11124.7280 Mon Jul 1 20:20:33 2024 Serializing color structure 11129.8650 Mon Jul 1 20:20:33 2024 Done

jnalanko commented 3 weeks ago

The part where it is hanging is very disk-heavy. My guess is that there is some hangup in the disk IO of the cluster, and Themisto is unable to recover from that. Unfortunately that part of the code is within the KMC library, so it's hard for me to fix.

In the long term I'm planning to get rid of the KMC dependency altogether, so this should be fixed then.