ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
737 stars 134 forks source link

stucked at subclustering Hamming graph #187

Open versaille opened 5 years ago

versaille commented 5 years ago

machine: disk 512GB, RAM 750GB, 16 cores dataset: 23Gb, 15GB commands that I ran for 23Gb dataset: spades.py -o $OUT_DIR/$name --only-error-correction -s $i -m 500

and here is the log:

1:10:47.728     3G / 9G    INFO   K-mer Counting           (kmer_data.cpp             : 356)   Arranging kmers in hash map order
  1:26:25.725   123G / 123G  INFO    General                 (main.cpp                  : 148)   Clustering Hamming graph.
  7:56:47.542   123G / 123G  INFO    General                 (main.cpp                  : 155)   Extracting clusters
 11:05:11.910   123G / 406G  INFO    General                 (main.cpp                  : 167)   Clustering done. Total clusters: 5796388586
 11:05:16.870    63G / 406G  INFO   K-mer Counting           (kmer_data.cpp             : 376)   Collecting K-mer information, this takes a while.
 11:12:58.457   242G / 406G  INFO   K-mer Counting           (kmer_data.cpp             : 382)   Processing /var/opt/app/data/exome/anno_0043B1_I102_LG04.fastq
 11:57:58.965   242G / 406G  INFO   K-mer Counting           (kmer_data.cpp             : 389)   Collection done, postprocessing.
 12:10:15.501   242G / 406G  INFO   K-mer Counting           (kmer_data.cpp             : 403)   There are 8031291532 kmers in total. Among them 7595306182 (94.5714%) are singletons.
 12:10:15.501   242G / 406G  INFO    General                 (main.cpp                  : 173)   Subclustering Hamming graph
(END)

The program was stuck at the step of "Subclustering Hamming graph" for a week and when I checked the memory and cpu usage. The memory usage is around what's in the log file ~250G and the CPU usage is 15xx% meaning all 16 cores are used. By ps this job, its Status is 'R' meaning it is still running.
I then killed the job and re-ran the below command on a smaller (15Gb) dataset: spades.py -o $OUT_DIR/$name --only-error-correction -s $i -m 730

And this time the log looks like this:

0:49:25.626     2G / 9G    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 336)   Index built. Total 2414323486 bytes occupied (3.70969 bits per kmer).
  0:49:25.626     2G / 9G    INFO   K-mer Counting           (kmer_data.cpp             : 356)   Arranging kmers in hash map order
  1:00:04.144    80G / 80G   INFO    General                 (main.cpp                  : 148)   Clustering Hamming graph.
  6:01:36.962    80G / 80G   INFO    General                 (main.cpp                  : 155)   Extracting clusters
  7:49:23.697    80G / 269G  INFO    General                 (main.cpp                  : 167)   Clustering done. Total clusters: 3950980107
  7:49:26.725    41G / 269G  INFO   K-mer Counting           (kmer_data.cpp             : 376)   Collecting K-mer information, this takes a while.
  7:54:25.095   157G / 269G  INFO   K-mer Counting           (kmer_data.cpp             : 382)   Processing /var/opt/app/res/exome/sub_50M/sub_anno_0043B1_I102_LG04.fastq
  8:24:21.632   157G / 269G  INFO   K-mer Counting           (kmer_data.cpp             : 389)   Collection done, postprocessing.
  8:31:56.085   157G / 269G  INFO   K-mer Counting           (kmer_data.cpp             : 403)   There are 5206516672 kmers in total. Among them 4908659316 (94.2791%) are singletons.
  8:31:56.085   157G / 269G  INFO    General                 (main.cpp                  : 173)   Subclustering Hamming graph

I'm worried it now gets stucked at this Subclustering Hamming graph again. So my question for the author is: For dataset that's ~ 20Gb, how long does this step usually take? Should it be a very time-consuming step in your algorithm? Can I make it faster by adding more CPU cores?

Your help is appreciated!

asl commented 5 years ago

Well... it seems it's quite a large dataset. What is the expected genome size? Is this is a single species? Or a metagenome?

versaille commented 5 years ago

it is human exome data with around 100M reads. I downsampled to around 50M reads and it still stucked there though

donkang75 commented 5 years ago

Any update for this thread? I am in the similar situation here. Using 64 cores with 961GB memory, it is still running for 3 days and seems stuck in the 'subclustering hamming graph' stage:

29:12:14.835 637G / 705G INFO K-mer Counting (kmer_data.cpp : 389) Collection done, postprocessing. 29:13:59.798 637G / 705G INFO K-mer Counting (kmer_data.cpp : 403) There are 21074441118 kmers in total. Among them 15445310206 (73.2893%) are singletons. 29:13:59.798 637G / 705G INFO General (main.cpp : 173) Subclustering Hamming graph

It's an about 2.5~3G mammalian genome. Any help would be appreciated. Thanks.

asl commented 5 years ago

Any update for this thread? I am in the similar situation here. Using 64 cores with 961GB memory, it is still running for 3 days and seems stuck in the 'subclustering hamming graph' stage:

It's perfectly ok given the input data size.

donkang75 commented 5 years ago

Thanks. I was tempted to kill the process. Any estimate for the completion? Again I’m so much appreciated of your response!

asl commented 5 years ago

Well... given the size of the data... I would give at least a week or two.