dib-lab / kProcessor

kProcessor: kmers processing framework.
https://kprocessor.readthedocs.io
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

Indexing failed with huge number of sequences #38

Open mr-eyes opened 4 years ago

mr-eyes commented 4 years ago

Reproduce the issue

1. Required packages

wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010757/SRR11015356 -O SRR11015356.sra
fastq-dump --fasta 0 --split-files SRR11015356.sra

3. Creating cDBG

3.1 cDBG creation with k=75
ls -1 *fasta > list_reads
bcalm -kmer-size 75 -max-memory 12000 -out SRR11015356_k75 -in list_reads
3.2 cDBG unitigs fasta file stats
file                        format  type    num_seqs        sum_len  min_len  avg_len  max_len
SRR11015356_k75.unitigs.fa  FASTA   DNA   11,824,622  1,133,741,131       75     95.9    1,683

4. kProcessor indexing

4.1 Generating names file
cat SRR11015356_k75.unitigs.fa | grep ">" | awk -F\, '{print (substr($0,2))"\t"substr($0,2,1)}' > SRR11015356_k75.unitigs.fa.names
4.2 Indexing
import kProcessor as kp

fasta_file = "SRR11015356_k75.unitigs.fa"
names_file = fasta_file + ".names"

kSize, Q, mode, chunk_size = 25, 29, 1, int(1e4)

kf = kp.kDataFrameMQF(kSize, Q, mode)

print("Indexing ...")

ckf = kp.index(kf, {"mode" : mode}, fasta_file , chunk_size, names_file)

print("Serializing the index ...")

ckf.save("idx_k25_SRR11015356_k75.unitigs")
4.3 Output
Click to the expand the output! ```bash stack trace: 0 [0x7f3c4f5f7c12] /home/mabuelanin/miniconda3/envs/kspider/lib/python3.7/site-packages/_kProcessor.cpython-37m-x86_64-linux-gnu.so(+0x356c12) 1 [0x7f3c50f32f20] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) 2 [0x7f3c51082b53] /lib/x86_64-linux-gnu/libc.so.6(+0x18eb53) 3 [0x7f3c4f6062ad] std::vector >::vector(std::vector > const&) + 0x7d 4 [0x7f3c4f60282a] kProcessor::index(kDataFrame*, std::map, std::allocator >, int, std::less, std::allocator > >, std::allocator, std::allocator > const, int> > >, std::__cxx11::basic_string, std::allocator >, int, std::__cxx11::basic_string, std::allocator >) + 0x147a 5 [0x7f3c4f5e3266] /home/mabuelanin/miniconda3/envs/kspider/lib/python3.7/site-packages/_kProcessor.cpython-37m-x86_64-linux-gnu.so(+0x342266) 6 [0x5638ef64a6f0] _PyMethodDef_RawFastCallKeywords + 0x1e0 7 [0x5638ef64a891] _PyCFunction_FastCallKeywords + 0x21 8 [0x5638ef6b7fce] _PyEval_EvalFrameDefault + 0x4ede 9 [0x5638ef649cfb] _PyFunction_FastCallKeywords + 0xfb 10 [0x5638ef6b7c59] _PyEval_EvalFrameDefault + 0x4b69 11 [0x5638ef5f8929] _PyEval_EvalCodeWithName + 0x2f9 12 [0x5638ef5f97e4] PyEval_EvalCodeEx + 0x44 13 [0x5638ef5f980c] PyEval_EvalCode + 0x1c 14 [0x5638ef711ac4] python(+0x22fac4) 15 [0x5638ef71bdb1] PyRun_FileExFlags + 0xa1 16 [0x5638ef71bfa3] PyRun_SimpleFileExFlags + 0x1c3 17 [0x5638ef71d0bf] python(+0x23b0bf) 18 [0x5638ef71d1dc] _Py_UnixMain + 0x3c Segmentation fault (core dumped) ```
mr-eyes commented 4 years ago

A correction for the previous issue description

Reproduce the issue

I suspect the issue to be in Python memory management and deallocation for the flat_parallel_hashmap holding the kmers chunks. To be followed ...

1. Required packages

wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010757/SRR11015356 -O SRR11015356.sra
fastq-dump --fasta 0 --split-files SRR11015356.sra

3. Creating cDBG

3.1 cDBG creation with k=75
ls -1 *fasta > list_reads
bcalm -kmer-size 75 -max-memory 12000 -out SRR11015356_k75 -in list_reads
3.2 cDBG unitigs fasta file stats
file                        format  type    num_seqs        sum_len  min_len  avg_len  max_len
SRR11015356_k75.unitigs.fa  FASTA   DNA   11,824,622  1,133,741,131       75     95.9    1,683

4. kProcessor indexing

4.1 Generating names file
grep ">" SRR11015356_k75.unitigs.fa | cut -c2- |  awk -F' ' '{print $0"\t"$1}' > SRR11015356_k75.unitigs.fa.names
4.2 Indexing
import kProcessor as kp

fasta_file = "SRR11015356_k75.unitigs.fa"
names_file = fasta_file + ".names"

kSize, Q, mode, chunk_size = 25, 29, 1, int(1e4)

kf = kp.kDataFrameMQF(kSize, Q, mode)

print("Indexing ...")

ckf = kp.index(kf, {"mode" : mode}, fasta_file , chunk_size, names_file)

print("Serializing the index ...")

ckf.save("idx_k25_SRR11015356_k75.unitigs")
mr-eyes commented 4 years ago

After having some debugging in the C++ source code, now it's certain that the huge size of hashmap legends is the bottleneck.

The huge number of colors causes slowness in insertion and inflation in the memory consumption