Question about CONCOCT slurm log information and parallelization

franciscozorrilla commented 5 years ago

Hello again!

I am wondering what exactly the information in the slurm log file is telling me:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   concoct
    1

rule concoct:
    input: concoct-input/concoct_inputtable.tsv
    output: concoct-output/clustering_gt1000.csv
    jobid: 0

Up and running. Check /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/pipeTest4/concoct-output/log.txt for progress
Setting 1 OMP threads
Generate input data
0,-103139817.027521,14900613.051910
1,-97987512.997802,5152304.029719
2,-95886695.918660,2100817.079141
3,-94823154.517955,1063541.400706
4,-94152313.277927,670841.240027
5,-93730494.103278,421819.174649
6,-93438200.698655,292293.404622
7,-93210526.395924,227674.302731
8,-93038887.381231,171639.014693
9,-92902578.178970,136309.202261
10,-92796577.324625,106000.854345
11,-92673930.627172,122646.697453
12,-92546673.603888,127257.023284
13,-92466522.085908,80151.517980
14,-92411114.468767,55407.617141
15,-92342712.132324,68402.336443
16,-92281572.631503,61139.500821
17,-92227835.787832,53736.843671
18,-92164554.098474,63281.689358
19,-92105295.113753,59258.984720
20,-92039941.789678,65353.324075
21,-91979566.483008,60375.306670
22,-91907217.940677,72348.542331

In particular, I am wondering what the comma-separated list of numbers at the bottom mean? I cant seem to find information about this.

Also, would CONCOCT run faster by increasing the number of OMP threads? I tried this by using the option:

-t 20

But this seemed to generate a 245 Gb (and growing) slurm file containing the following warning over and over again:

OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option.

I am not sure if CONCOCT was running properly despite the warning , but I had to quit the job due to the large space usage it was generating. Perhaps this warning can be silenced somehow?

In any case, CONCOCT seems to be parallelizing with only 1 OMP thread (default setting) as shown below:

Also, here is the CONCOCT log:

2018-12-31 01:09:09,979:INFO:root:Results created at /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/pipeTest4/concoct-output
2018-12-31 01:29:06,637:INFO:root:Successfully loaded composition data.
2018-12-31 01:30:36,028:INFO:root:Successfully loaded coverage data.
2018-12-31 01:30:55,234:INFO:root:Performed PCA, resulted in 118 dimensions
2018-12-31 01:37:51,430:INFO:root:Wrote original filtered data file.
2018-12-31 01:41:05,758:INFO:root:Wrote PCA transformed file.
2018-12-31 01:41:05,775:INFO:root:Wrote PCA components file.
2018-12-31 01:41:05,775:INFO:root:PCA transformed data.
2018-12-31 01:41:05,775:INFO:root:Will call vbgmm with parameters: concoct-output/, 400, 1000, 1

Regards, Francisco

alneberg commented 5 years ago

Hi @franciscozorrilla,

The comma separated numbers show 1. the current iteration, 2. estimates of the model likelihood and 3. the difference in the likelihood from the previous iteration. All of these numbers can be safely ignored in the normal case.

Aha, if using the default setting is still parallelized, that might work well.

The recommended usage is to set -t 20 if you have 20 cpus available. The problem you describe is reported here: #232 and should be fixed in the latest bioconda version of CONCOCT (since 2 weeks ago only).

I think you should abort the run with the endless error messages since the printing will slow down the run severely.

Please keep me informed on your progress. Perhaps try out the brand new gitter channel: https://gitter.im/BinPro/CONCOCT

franciscozorrilla commented 5 years ago

Hi @alneberg,

Thanks for the tip about the gitter channel, I will use it for posting small questions like these in the future. And sorry for posting about an existing issue! I have followed your instructions from issue #232 to fix the issue, and am now testing using the -t 20 parameter instead of the default settings. Hopefully this speeds up the process! Will keep you posted.

franciscozorrilla commented 5 years ago

Update

CONCOCT job started running just over 2 days ago on a 128GB RAM node with 20 CPUs.

CPU usage over time:

CPU usage at arbitrary time point (~54hr into job): Note that CPU time is almost 1000hrs

Abridged slurm log:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 20
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   concoct
    1

rule concoct:
    input: concoct-input/concoct_inputtable.tsv
    output: concoct-output/clustering_gt1000.csv
    jobid: 0

Up and running. Check /binning/pipeTest4/concoct-output/log.txt for progress
Setting 20 OMP threads
Generate input data
0,-103139817.027521,14900613.051910
1,-97987512.997802,5152304.029719
2,-95886695.918660,2100817.079141
              ...
376,-91134187.186326,31.962741
377,-91134226.940277,39.753951

Fortunately fixing the OpenBLAS warning as shown in #232 and using -t 20 seems to improve speed dramatically. However, the process is still quite time consuming.

Do you think I should try running the job on a 512 or 1024 GB RAM node instead of 128? Or would this just be a waste/over-allocation of resources?

alneberg commented 5 years ago

That does look much better! I would just keep running this one for as long as you are allowed to and hope that it finishes before that. I don't think increasing the RAM makes much difference. Thanks for reporting back!

franciscozorrilla commented 5 years ago

The job just finished!

925,-91133399.572133,0.000383
926,-91133399.572284,0.000150
927,-91133399.572229,0.000055
CONCOCT Finished, the log shows how it went.
1012380 31637 20
Finished job 0.
1 of 1 steps (100%) done
Complete log: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/pipeTest4/.snakemake/log/2019-01-12T072929.556421.snakemake.log

Should I care about what the 3 numbers after the CONCOCT Finished... line mean? i.e. 1012380 31637 20

Here is the log file as well:

2019-01-12 07:31:21,080:INFO:root:Results created at /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/pipeTest4/concoct-output
2019-01-12 07:51:03,919:INFO:root:Successfully loaded composition data.
2019-01-12 07:52:40,140:INFO:root:Successfully loaded coverage data.
2019-01-12 07:53:00,184:INFO:root:Performed PCA, resulted in 118 dimensions
2019-01-12 07:59:53,037:INFO:root:Wrote original filtered data file.
2019-01-12 08:03:08,539:INFO:root:Wrote PCA transformed file.
2019-01-12 08:03:09,091:INFO:root:Wrote PCA components file.
2019-01-12 08:03:09,091:INFO:root:PCA transformed data.
2019-01-12 08:03:09,091:INFO:root:Will call vbgmm with parameters: concoct-output/, 400, 1000, 20
2019-01-17 12:35:45,714:INFO:root:Wrote assign file.
2019-01-17 12:35:45,737:INFO:root:CONCOCT Finished

It took 5 days to finish the job which is not ideal. I did however neglected to do a quality filtering step on my .fastq files at the start of my workflow, so hopefully after I implement this step and run CONCOCT again it will run faster. Any comments, concerns, or suggestions?

alneberg commented 5 years ago

Great!

I am actually not able to locate the line printing those last numbers. @chrisquince, could you help me out, please?

@franciscozorrilla, it would be nice to see how many contigs that was above the length threshold. Would you mind running a wc -l on the clustering output file clustering_gt1000.csv?

franciscozorrilla commented 5 years ago

No problemo

wc -l clustering_gt1000.csv
1012381 clustering_gt1000.csv

Is this a reasonable number?

chrisquince commented 5 years ago

Hi @franciscozorrilla, that all looks fine. Those three numbers are unimportant. The important one is the:

927,-91133399.572229,0.000055

The small third value indicates convergence. The other important thing is to compare the number of clusters you obtained:

cut -d"," -f2 clustering_gt1000.csv | sort | uniq -c | wc

against the number you started with (default 400). Ideally you want 50% so smaller than 200.

Thanks, Chris

alneberg commented 5 years ago

Thank you Chris!

Ah, there we have the first number printed of the three. 1012381 - 1 = 1012380 number of contigs longer than 1kb. That's quite a lot. @chrisquince, would you think it's a reasonable runtime: 5days with 20 cpus and 1M contigs?

Also, I don't think quality filtering the fastq would make concoct go faster. It's still recommended though since you would get higher quality assembly and likely longer contigs and in the end better bins.

chrisquince commented 5 years ago

Yes, I would actually say that is rather fast. Agree about quality filtering.

franciscozorrilla commented 5 years ago

Ok, good to know that this is an expected run time length!

@chrisquince

cut -d"," -f2 clustering_gt1000.csv | sort | uniq -c | wc
    394     788    4625

I did leave the default value of clusters at 400. Does this mean the results are unreliable?

chrisquince commented 5 years ago

Well, it means that ideally you need to increase the starting cluster number run with:

-c 800

flag set. That will take twice as long though unless you can get more cores. There is another strategy though which will give results that are almost as good which is to refine the clusters. That is explained here but we need some coherent documentation:

https://github.com/chrisquince/StrainMetaSim

see command:

concoct_refine clustering_gt1000_R.csv original_data_gt1000.csv clustering_gt1000_scg_sort.csv

Thanks, Chris

franciscozorrilla commented 5 years ago

I see, thank you for the explanation! I will look into getting more cores for a future run, for now I will try the concoct_refine command.

chrisquince commented 5 years ago

OK that is not well documented, sorry, you need to prepare input with the following commands:

sed '1d' clustering_gt1000.csv > clustering_gt1000_R.csv cut -f1,3- < clustering_gt1000_scg.tsv | tr "\t" "," > clustering_gt1000_scg.csv $CONCOCT/scripts/Sort.pl < clustering_gt1000_scg.csv > clustering_gt1000_scg_sort.csv

Assuming you have the scg table?

franciscozorrilla commented 5 years ago

Which file is the scg table and where does it come from? From what I understand this is associated with the COG based evaluation right?

At @alneberg 's suggestion I moved away from the COG based evaluation and am now using CheckM. Is it possible to use the output from CheckM to refine clusters?

chrisquince commented 5 years ago

OK, I will have a go at refining the program to take checkM input, may take a couple of days though

alneberg commented 5 years ago

That would be amazing @chrisquince!

chrisquince commented 5 years ago

Could someone give me some example checkM output from a run with concoct files too?

franciscozorrilla commented 5 years ago

Sure, I opened issue #240 to show the evaluation of CONCOCT-test-data-0.3.2 using CheckM. I will now upload the actual output folder with all the files generated. Thanks!

alneberg commented 5 years ago

Thank you @franciscozorrilla!

franciscozorrilla commented 5 years ago

@alneberg No prob! let me know if you would like any other files/info.

Also I ran CheckM on the above mentioned CONCOCT run (not the test data), and yeah it looks like many bins have > 100% contamination, verifying that the default cluster parameter value ( -c 400 ) for CONCOCT was indeed too low. Here is the CheckM results if anyone is interested (statistics are at the very end of file):

CheckMout.txt

BinPro / CONCOCT

Question about CONCOCT slurm log information and parallelization #238