mr-eyes commented 4 years ago

Reproduce the issue

1. Required packages

bcalm conda install -c bioconda -y bcalm
sra-tools conda install -c bioconda -y sra-tools

2. Data preparation

wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010757/SRR11015356 -O SRR11015356.sra
fastq-dump --fasta 0 --split-files SRR11015356.sra

3. Creating cDBG

3.1 cDBG creation with k=75

ls -1 *fasta > list_reads
bcalm -kmer-size 75 -max-memory 12000 -out SRR11015356_k75 -in list_reads

3.2 cDBG unitigs fasta file stats

file                        format  type    num_seqs        sum_len  min_len  avg_len  max_len
SRR11015356_k75.unitigs.fa  FASTA   DNA   11,824,622  1,133,741,131       75     95.9    1,683

4. kProcessor indexing

4.1 Generating names file

cat SRR11015356_k75.unitigs.fa | grep ">" | awk -F\, '{print (substr($0,2))"\t"substr($0,2,1)}' > SRR11015356_k75.unitigs.fa.names

4.2 Indexing

import kProcessor as kp

fasta_file = "SRR11015356_k75.unitigs.fa"
names_file = fasta_file + ".names"

kSize, Q, mode, chunk_size = 25, 29, 1, int(1e4)

kf = kp.kDataFrameMQF(kSize, Q, mode)

print("Indexing ...")

ckf = kp.index(kf, {"mode" : mode}, fasta_file , chunk_size, names_file)

print("Serializing the index ...")

ckf.save("idx_k25_SRR11015356_k75.unitigs")

4.3 Output

Click to the expand the output!

```bash stack trace: 0 [0x7f3c4f5f7c12] /home/mabuelanin/miniconda3/envs/kspider/lib/python3.7/site-packages/_kProcessor.cpython-37m-x86_64-linux-gnu.so(+0x356c12) 1 [0x7f3c50f32f20] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) 2 [0x7f3c51082b53] /lib/x86_64-linux-gnu/libc.so.6(+0x18eb53) 3 [0x7f3c4f6062ad] std::vector >::vector(std::vector > const&) + 0x7d 4 [0x7f3c4f60282a] kProcessor::index(kDataFrame*, std::map, std::allocator >, int, std::less, std::allocator > >, std::allocator, std::allocator > const, int> > >, std::__cxx11::basic_string, std::allocator >, int, std::__cxx11::basic_string, std::allocator >) + 0x147a 5 [0x7f3c4f5e3266] /home/mabuelanin/miniconda3/envs/kspider/lib/python3.7/site-packages/_kProcessor.cpython-37m-x86_64-linux-gnu.so(+0x342266) 6 [0x5638ef64a6f0] _PyMethodDef_RawFastCallKeywords + 0x1e0 7 [0x5638ef64a891] _PyCFunction_FastCallKeywords + 0x21 8 [0x5638ef6b7fce] _PyEval_EvalFrameDefault + 0x4ede 9 [0x5638ef649cfb] _PyFunction_FastCallKeywords + 0xfb 10 [0x5638ef6b7c59] _PyEval_EvalFrameDefault + 0x4b69 11 [0x5638ef5f8929] _PyEval_EvalCodeWithName + 0x2f9 12 [0x5638ef5f97e4] PyEval_EvalCodeEx + 0x44 13 [0x5638ef5f980c] PyEval_EvalCode + 0x1c 14 [0x5638ef711ac4] python(+0x22fac4) 15 [0x5638ef71bdb1] PyRun_FileExFlags + 0xa1 16 [0x5638ef71bfa3] PyRun_SimpleFileExFlags + 0x1c3 17 [0x5638ef71d0bf] python(+0x23b0bf) 18 [0x5638ef71d1dc] _Py_UnixMain + 0x3c Segmentation fault (core dumped) ```

mr-eyes commented 4 years ago

A correction for the previous issue description

Reproduce the issue

I suspect the issue to be in Python memory management and deallocation for the flat_parallel_hashmap holding the kmers chunks. To be followed ...

1. Required packages

bcalm conda install -c bioconda -y bcalm
sra-tools conda install -c bioconda -y sra-tools

2. Data preparation

wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010757/SRR11015356 -O SRR11015356.sra
fastq-dump --fasta 0 --split-files SRR11015356.sra

3. Creating cDBG

3.1 cDBG creation with k=75

ls -1 *fasta > list_reads
bcalm -kmer-size 75 -max-memory 12000 -out SRR11015356_k75 -in list_reads

3.2 cDBG unitigs fasta file stats

file                        format  type    num_seqs        sum_len  min_len  avg_len  max_len
SRR11015356_k75.unitigs.fa  FASTA   DNA   11,824,622  1,133,741,131       75     95.9    1,683

4. kProcessor indexing

4.1 Generating names file

grep ">" SRR11015356_k75.unitigs.fa | cut -c2- |  awk -F' ' '{print $0"\t"$1}' > SRR11015356_k75.unitigs.fa.names

4.2 Indexing

import kProcessor as kp

fasta_file = "SRR11015356_k75.unitigs.fa"
names_file = fasta_file + ".names"

kSize, Q, mode, chunk_size = 25, 29, 1, int(1e4)

kf = kp.kDataFrameMQF(kSize, Q, mode)

print("Indexing ...")

ckf = kp.index(kf, {"mode" : mode}, fasta_file , chunk_size, names_file)

print("Serializing the index ...")

ckf.save("idx_k25_SRR11015356_k75.unitigs")

mr-eyes commented 4 years ago

After having some debugging in the C++ source code, now it's certain that the huge size of hashmap legends is the bottleneck.

The huge number of colors causes slowness in insertion and inflation in the memory consumption

dib-lab / kProcessor

Indexing failed with huge number of sequences #38

Reproduce the issue

1. Required packages

sra-tools `conda install -c bioconda -y sra-tools`

2. Data preparation

3. Creating cDBG

3.1 cDBG creation with k=75

3.2 cDBG unitigs fasta file stats

4. kProcessor indexing

4.1 Generating names file

4.2 Indexing

4.3 Output

A correction for the previous issue description

Reproduce the issue

1. Required packages

sra-tools `conda install -c bioconda -y sra-tools`

2. Data preparation

3. Creating cDBG

3.1 cDBG creation with k=75

3.2 cDBG unitigs fasta file stats

4. kProcessor indexing

4.1 Generating names file

4.2 Indexing

dib-lab / kProcessor

Indexing failed with huge number of sequences #38

Reproduce the issue

1. Required packages

sra-tools conda install -c bioconda -y sra-tools

2. Data preparation

3. Creating cDBG

3.1 cDBG creation with k=75

3.2 cDBG unitigs fasta file stats

4. kProcessor indexing

4.1 Generating names file

4.2 Indexing

4.3 Output

A correction for the previous issue description

Reproduce the issue

1. Required packages

sra-tools conda install -c bioconda -y sra-tools

2. Data preparation

3. Creating cDBG

3.1 cDBG creation with k=75

3.2 cDBG unitigs fasta file stats

4. kProcessor indexing

4.1 Generating names file

4.2 Indexing

sra-tools `conda install -c bioconda -y sra-tools`

sra-tools `conda install -c bioconda -y sra-tools`