Open versaille opened 5 years ago
Well... it seems it's quite a large dataset. What is the expected genome size? Is this is a single species? Or a metagenome?
it is human exome data with around 100M reads. I downsampled to around 50M reads and it still stucked there though
Any update for this thread? I am in the similar situation here. Using 64 cores with 961GB memory, it is still running for 3 days and seems stuck in the 'subclustering hamming graph' stage:
29:12:14.835 637G / 705G INFO K-mer Counting (kmer_data.cpp : 389) Collection done, postprocessing. 29:13:59.798 637G / 705G INFO K-mer Counting (kmer_data.cpp : 403) There are 21074441118 kmers in total. Among them 15445310206 (73.2893%) are singletons. 29:13:59.798 637G / 705G INFO General (main.cpp : 173) Subclustering Hamming graph
It's an about 2.5~3G mammalian genome. Any help would be appreciated. Thanks.
Any update for this thread? I am in the similar situation here. Using 64 cores with 961GB memory, it is still running for 3 days and seems stuck in the 'subclustering hamming graph' stage:
It's perfectly ok given the input data size.
Thanks. I was tempted to kill the process. Any estimate for the completion? Again I’m so much appreciated of your response!
Well... given the size of the data... I would give at least a week or two.
machine: disk 512GB, RAM 750GB, 16 cores dataset: 23Gb, 15GB commands that I ran for 23Gb dataset: spades.py -o $OUT_DIR/$name --only-error-correction -s $i -m 500
and here is the log:
The program was stuck at the step of "Subclustering Hamming graph" for a week and when I checked the memory and cpu usage. The memory usage is around what's in the log file ~250G and the CPU usage is 15xx% meaning all 16 cores are used. By ps this job, its Status is 'R' meaning it is still running.
I then killed the job and re-ran the below command on a smaller (15Gb) dataset: spades.py -o $OUT_DIR/$name --only-error-correction -s $i -m 730
And this time the log looks like this:
I'm worried it now gets stucked at this Subclustering Hamming graph again. So my question for the author is: For dataset that's ~ 20Gb, how long does this step usually take? Should it be a very time-consuming step in your algorithm? Can I make it faster by adding more CPU cores?
Your help is appreciated!