Closed franciscozorrilla closed 4 years ago
Hi @franciscozorrilla,
The comma separated numbers show 1. the current iteration, 2. estimates of the model likelihood and 3. the difference in the likelihood from the previous iteration. All of these numbers can be safely ignored in the normal case.
Aha, if using the default setting is still parallelized, that might work well.
The recommended usage is to set -t 20
if you have 20 cpus available. The problem you describe is reported here: #232 and should be fixed in the latest bioconda version of CONCOCT (since 2 weeks ago only).
I think you should abort the run with the endless error messages since the printing will slow down the run severely.
Please keep me informed on your progress. Perhaps try out the brand new gitter channel: https://gitter.im/BinPro/CONCOCT
Hi @alneberg,
Thanks for the tip about the gitter channel, I will use it for posting small questions like these in the future. And sorry for posting about an existing issue! I have followed your instructions from issue #232 to fix the issue, and am now testing using the -t 20 parameter instead of the default settings. Hopefully this speeds up the process! Will keep you posted.
Update
CONCOCT job started running just over 2 days ago on a 128GB RAM node with 20 CPUs.
CPU usage over time:
CPU usage at arbitrary time point (~54hr into job):
Note that CPU time is almost 1000hrs
Abridged slurm log:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 concoct
1
rule concoct:
input: concoct-input/concoct_inputtable.tsv
output: concoct-output/clustering_gt1000.csv
jobid: 0
Up and running. Check /binning/pipeTest4/concoct-output/log.txt for progress
Setting 20 OMP threads
Generate input data
0,-103139817.027521,14900613.051910
1,-97987512.997802,5152304.029719
2,-95886695.918660,2100817.079141
...
376,-91134187.186326,31.962741
377,-91134226.940277,39.753951
Fortunately fixing the OpenBLAS warning as shown in #232 and using -t 20
seems to improve speed dramatically. However, the process is still quite time consuming.
Do you think I should try running the job on a 512 or 1024 GB RAM node instead of 128? Or would this just be a waste/over-allocation of resources?
That does look much better! I would just keep running this one for as long as you are allowed to and hope that it finishes before that. I don't think increasing the RAM makes much difference. Thanks for reporting back!
The job just finished!
925,-91133399.572133,0.000383
926,-91133399.572284,0.000150
927,-91133399.572229,0.000055
CONCOCT Finished, the log shows how it went.
1012380 31637 20
Finished job 0.
1 of 1 steps (100%) done
Complete log: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/pipeTest4/.snakemake/log/2019-01-12T072929.556421.snakemake.log
Should I care about what the 3 numbers after the CONCOCT Finished... line mean? i.e. 1012380 31637 20
Here is the log file as well:
2019-01-12 07:31:21,080:INFO:root:Results created at /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/pipeTest4/concoct-output
2019-01-12 07:51:03,919:INFO:root:Successfully loaded composition data.
2019-01-12 07:52:40,140:INFO:root:Successfully loaded coverage data.
2019-01-12 07:53:00,184:INFO:root:Performed PCA, resulted in 118 dimensions
2019-01-12 07:59:53,037:INFO:root:Wrote original filtered data file.
2019-01-12 08:03:08,539:INFO:root:Wrote PCA transformed file.
2019-01-12 08:03:09,091:INFO:root:Wrote PCA components file.
2019-01-12 08:03:09,091:INFO:root:PCA transformed data.
2019-01-12 08:03:09,091:INFO:root:Will call vbgmm with parameters: concoct-output/, 400, 1000, 20
2019-01-17 12:35:45,714:INFO:root:Wrote assign file.
2019-01-17 12:35:45,737:INFO:root:CONCOCT Finished
It took 5 days to finish the job which is not ideal. I did however neglected to do a quality filtering step on my .fastq files at the start of my workflow, so hopefully after I implement this step and run CONCOCT again it will run faster. Any comments, concerns, or suggestions?
Great!
I am actually not able to locate the line printing those last numbers. @chrisquince, could you help me out, please?
@franciscozorrilla, it would be nice to see how many contigs that was above the length threshold. Would you mind running a wc -l
on the clustering output file clustering_gt1000.csv
?
No problemo
wc -l clustering_gt1000.csv
1012381 clustering_gt1000.csv
Is this a reasonable number?
Hi @franciscozorrilla, that all looks fine. Those three numbers are unimportant. The important one is the:
927,-91133399.572229,0.000055
The small third value indicates convergence. The other important thing is to compare the number of clusters you obtained:
cut -d"," -f2 clustering_gt1000.csv | sort | uniq -c | wc
against the number you started with (default 400). Ideally you want 50% so smaller than 200.
Thanks, Chris
Thank you Chris!
Ah, there we have the first number printed of the three. 1012381 - 1 = 1012380 number of contigs longer than 1kb. That's quite a lot. @chrisquince, would you think it's a reasonable runtime: 5days with 20 cpus and 1M contigs?
Also, I don't think quality filtering the fastq would make concoct go faster. It's still recommended though since you would get higher quality assembly and likely longer contigs and in the end better bins.
Yes, I would actually say that is rather fast. Agree about quality filtering.
Ok, good to know that this is an expected run time length!
@chrisquince
cut -d"," -f2 clustering_gt1000.csv | sort | uniq -c | wc
394 788 4625
I did leave the default value of clusters at 400. Does this mean the results are unreliable?
Well, it means that ideally you need to increase the starting cluster number run with:
-c 800
flag set. That will take twice as long though unless you can get more cores. There is another strategy though which will give results that are almost as good which is to refine the clusters. That is explained here but we need some coherent documentation:
https://github.com/chrisquince/StrainMetaSim
see command:
concoct_refine clustering_gt1000_R.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv
Thanks, Chris
I see, thank you for the explanation! I will look into getting more cores for a future run, for now I will try the concoct_refine
command.
OK that is not well documented, sorry, you need to prepare input with the following commands:
sed '1d' clustering_gt1000.csv > clustering_gt1000_R.csv cut -f1,3- < clustering_gt1000_scg.tsv | tr "\t" "," > clustering_gt1000_scg.csv $CONCOCT/scripts/Sort.pl < clustering_gt1000_scg.csv > clustering_gt1000_scg_sort.csv
Assuming you have the scg table?
Which file is the scg table and where does it come from? From what I understand this is associated with the COG based evaluation right?
At @alneberg 's suggestion I moved away from the COG based evaluation and am now using CheckM. Is it possible to use the output from CheckM to refine clusters?
OK, I will have a go at refining the program to take checkM input, may take a couple of days though
That would be amazing @chrisquince!
Could someone give me some example checkM output from a run with concoct files too?
Sure, I opened issue #240 to show the evaluation of CONCOCT-test-data-0.3.2 using CheckM. I will now upload the actual output folder with all the files generated. Thanks!
Thank you @franciscozorrilla!
@alneberg No prob! let me know if you would like any other files/info.
Also I ran CheckM on the above mentioned CONCOCT run (not the test data), and yeah it looks like many bins have > 100% contamination, verifying that the default cluster parameter value ( -c 400
) for CONCOCT was indeed too low. Here is the CheckM results if anyone is interested (statistics are at the very end of file):
Hello again!
I am wondering what exactly the information in the slurm log file is telling me:
In particular, I am wondering what the comma-separated list of numbers at the bottom mean? I cant seem to find information about this.
Also, would CONCOCT run faster by increasing the number of OMP threads? I tried this by using the option:
-t 20
But this seemed to generate a 245 Gb (and growing) slurm file containing the following warning over and over again:
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.
I am not sure if CONCOCT was running properly despite the warning , but I had to quit the job due to the large space usage it was generating. Perhaps this warning can be silenced somehow?
In any case, CONCOCT seems to be parallelizing with only 1 OMP thread (default setting) as shown below:
Also, here is the CONCOCT log:
Regards, Francisco