BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
120 stars 48 forks source link

Performance of the latest version #234

Closed bioinformatist closed 5 years ago

bioinformatist commented 5 years ago

Dear, after install it with conda, I prepared my data with example cmds listed as in github README using executable cut_up_fasta.py and concoct_coverage_table.py. I'm using HPC and I submit a mission by qsub with a PBS script as this:

#!/bin/bash -l
#PBS -m abe
#PBS -M suny226@mail2.sysu.edu.cn
#PBS -q fat_q
#PBS -N concoct
#PBS -l walltime=5000:00:00,nodes=1:ppn=96,mem=2900gb
cd $PBS_O_WORKDIR
source activate ysun-env
concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/ -t 96 -o

It seems works well in first several hours, and I get some intermediate/parts of final results: image

Then, things comes to so strange after two days, I have checked job status with qstat: image

It takes about 2091 hours of CPU time, and ~1.5 days of real time. I also checked log file: image You can see the last line of log, it stays at this state for more than one day in real world. I'm sorry that I'm not so clear with concoct's logic, so can you help me with this matter? I wonder if there's some mistakes in the parameter values I assigned or other things (I'm still green hands in metagenomics analysis). I can not start a new mission until the calculation is done, and I'm not sure if concoct does need so long time for my data. The file size of my data is listed as below: image Can you help me evaluate the running time it needed? Or any other suggestions? Thank you!

alneberg commented 5 years ago

Thank you for your detailed report! While this new version of concoct is much faster than the old one, for large datasets it will still take a considerable amount of time to run. I don't spot any errors. It is supposed to be in the "Will call vbgmm with parameters..." state for a really long time compare to the other states so that should be fine. Could you somehow see a % of cpus used at a point, from e.g. top or htop? I'm not so familiar with PBS. But a potential problem is if the parallel execution fails so it is only using 1 cpu, but maybe your cpu hour count states that it is using more than 1 cpu?

alneberg commented 5 years ago

@bioinformatist, are there any updates to this? I hope you're not seeing the issue reported here: #232. That issue should be fixed in the latest build on conda though.

alneberg commented 5 years ago

Closed due to no response.

jolespin commented 4 years ago

I had a few questions regarding performance:

My scaffolds.fasta from metaspades.py is 383MBwith 821204 contigs with the following stats with default quast settings: scaffolds, N50 = 1385, L50 = 24779, Total length = 176321802, GC % = 52.04, # N's per 100 kbp = 20.85

(1) How long is concoct expected to take with 1 thread? This has been running for a few days. (2) Is this going all the way up to 500 iterations? (3) Are these the recommended settings for such an assembly?

The command I ended up running is the following:

# indexing .bam alignment files...
for FILE in ${out}/work_files/*.bam; do
    echo $FILE
    samtools index -@ $threads -b $FILE
done

# cutting up contigs into 10kb fragments for CONCOCT...
cut_up_fasta.py ${out}/work_files/assembly.fa -c 10000 --merge_last -b ${out}/work_files/assembly_10K.bed -o 0 > ${out}/work_files/assembly_10K.fa

comm "estimating contig fragment coverage..."   
CMD="concoct_coverage_table.py ${out}/work_files/assembly_10K.bed ${out}/work_files/*.bam > ${out}/work_files/concoct_depth.txt"
$(eval $CMD)

# Starting binning with CONCOCT...
mkdir ${out}/work_files/concoct_out

concoct -l $len -t $threads \
    --coverage_file ${out}/work_files/concoct_depth.txt \
    --composition_file ${out}/work_files/assembly_10K.fa \
    -b ${out}/work_files/concoct_out

# merging 10kb fragments back into contigs
merge_cutup_clustering.py ${out}/work_files/concoct_out/clustering_gt${len}.csv > ${out}/work_files/concoct_out/clustering_gt${len}_merged.csv
Up and running. Check /local/ifs3_scratch/METAGENOMICS/jespinoz/Plastisphere/metagenomics_output/14-NT-02-bblueishsquare_S8/intermediate/metawrap_output/initial_binning/work_files/concoct_out/log.txt for progress
/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/concoct/input.py:82: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  cov = p.read_table(cov_file, header=0, index_col=0)
Setting 1 OMP threads
Generate input data
0,-22478879.925979,110293.849346
1,-22372193.322834,106686.603145
2,-22308208.805565,63984.517269
3,-22268555.722560,39653.083005
4,-22240955.284350,27600.438210
5,-22219202.592447,21752.691903
6,-22201362.189253,17840.403194
7,-22185066.678776,16295.510477
8,-22172272.391143,12794.287633
9,-22162387.590844,9884.800299
10,-22153638.569697,8749.021147
11,-22145328.208397,8310.361300
12,-22136875.012161,8453.196236
13,-22130160.226035,6714.786126
14,-22124832.318769,5327.907266
15,-22118605.750693,6226.568077
16,-22113714.392076,4891.358616
17,-22111116.150348,2598.241728
18,-22109266.615834,1849.534514
19,-22107863.425041,1403.190794
20,-22106756.169494,1107.255547
21,-22106101.883880,654.285614
22,-22104214.806839,1887.077041
23,-22102074.925793,2139.881046
24,-22097765.904718,4309.021074
25,-22089623.991813,8141.912905
26,-22078115.811542,11508.180272
27,-22070646.066715,7469.744827
28,-22066931.310706,3714.756010
29,-22062770.479958,4160.830747
30,-22057271.075967,5499.403991
31,-22053342.104817,3928.971150
32,-22045602.306856,7739.797961
33,-22040200.511516,5401.795340
34,-22030991.852705,9208.658811
35,-22020419.661502,10572.191203
36,-22006318.021295,14101.640207
37,-21994156.113599,12161.907696
38,-21981359.496225,12796.617375
39,-21968312.520165,13046.976060
40,-21956435.834304,11876.685860
41,-21944553.751284,11882.083021
42,-21929697.158868,14856.592415
43,-21916803.763705,12893.395164
44,-21903851.013182,12952.750522
45,-21890446.806437,13404.206745
46,-21874740.627558,15706.178879
47,-21865674.969252,9065.658306
48,-21856575.788741,9099.180511
49,-21849709.637546,6866.151195
50,-21843665.232820,6044.404726
51,-21838480.125516,5185.107304
52,-21833709.657442,4770.468074
53,-21829377.443746,4332.213696
54,-21820237.485109,9139.958637
55,-21814441.270689,5796.214421
56,-21809484.971056,4956.299632
57,-21800814.333545,8670.637512
58,-21798446.875325,2367.458219
59,-21793927.827017,4519.048308
60,-21788122.657859,5805.169158
61,-21784011.245365,4111.412494
62,-21779197.345067,4813.900297
63,-21774082.359272,5114.985796
64,-21771751.937424,2330.421847
65,-21768184.061049,3567.876375
66,-21765070.096427,3113.964622
67,-21763968.365879,1101.730548
68,-21761862.025429,2106.340451
69,-21759410.013561,2452.011868
70,-21756235.562102,3174.451459
71,-21751368.208441,4867.353661
72,-21749678.692535,1689.515907
73,-21747133.931666,2544.760868
74,-21745561.225005,1572.706661
75,-21743970.373048,1590.851958
76,-21741531.189675,2439.183373
77,-21740318.528715,1212.660960
78,-21737117.091371,3201.437343
79,-21735683.925048,1433.166323
80,-21734533.121224,1150.803824
81,-21733541.456714,991.664510
82,-21732023.608209,1517.848505
83,-21731831.257730,192.350479
84,-21731676.411725,154.846005
alneberg commented 4 years ago

Hi @jolespin. I cannot spot any error in your script.

I'm afraid running with 1 thread is not a good idea, but I assume you are doing this due to the #232 error? In that case, I'm happy to tell you that I believe that error is now finally resolved in the current conda installation.

While running times are very hard to estimate, I had a recent report of running time for 1M contigs of around 5 days using 20 threads, so you're looking at a very long run time... Most likely around 10 to 20 times slower due to the lack of parallelisation .

Probably worth testing out the new installation with more threads on a small data set and if it works fine, cancel this job?