BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
119 stars 48 forks source link

Converting jgi_summarize_bam_contig_depths output files into the concoct_coverage table #286

Closed franciscozorrilla closed 4 years ago

franciscozorrilla commented 4 years ago

Hello @alneberg ,

I am trying to convert some jgi_summarize_bam_contig_depths (from metabat2) output files into a concoct_coverage table. If I understand correctly this should be possible, but I am running into problems. Here is what I have done:

  1. Chose ERR599120 as focal sample
  2. Obtained indiviudal depth files for 3 samples (ERR599120,ERR599121,ERR599122) by running jgi_summarize_bam_contig_depths on corresponding sorted bam files:
    $ head ERR599120.depth
    contigName  contigLen   totalAvgDepth   ERR599120.sort  ERR599120.sort-var
    k119_371504-flag=1-multi=2.0000-len=322 322 4.19767 4.19767 1.63321
    k119_451110-flag=1-multi=2.0000-len=321 321 2.56725 2.56725 0.705745
    k119_145948-flag=1-multi=1.0000-len=317 317 1.55689 1.55689 0.248251
  3. Extract column 4 from each individual depth file, generate rownames file containing contig IDs, and paste these all together to generate concoct_coverage.table:
    
    for depth in *.depth;do less $depth|cut -f4 > $depth.col;done
    less ERR599120.depth|cut -f1 > rownames.depth
    paste rownames.depth *.depth.col > concoct_coverage.table

$ head concoct_coverage.table contigName ERR599120.sort ERR599121.sort ERR599122.sort k119_371504-flag=1-multi=2.0000-len=322 4.19767 1.23256 1.47093 k119_451110-flag=1-multi=2.0000-len=321 2.56725 0 0 k119_145948-flag=1-multi=1.0000-len=317 1.55689 1.61677 2.56886 k119_530712-flag=1-multi=1.0000-len=391 2.07054 1.27801 0.925311 k119_53072-flag=1-multi=1.0000-len=313 2.45399 3.47239 4.76687

3. Run CONCOCT:

$ concoct --composition_file contigs_10K.fa --coverage_file concoct_coverage.table -b test/ WARNING:root:CONCOCT is running in single threaded mode. Please, consider adjusting the --threads parameter. Up and running. Check /scratch/zorrilla/test/test/log.txt for progress Traceback (most recent call last): File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/bin/concoct", line 90, in results = main(args) File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/bin/concoct", line 40, in main args.seed File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/concoct/transform.py", line 5, in perform_pca pca_object = PCA(n_components=nc, random_state=seed).fit(d) File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 344, in fit self._fit(X) File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 391, in _fit copy=self.copy) File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/sklearn/utils/validation.py", line 586, in check_array context)) ValueError: Found array with 0 sample(s) (shape=(0, 140)) while a minimum of 1 is required.


I am aware that this error message was experience by users that had contig IDs exclusively made up of integers, and I even tried taking out all the numbers in the contig IDs but I still get this same error.

Any idea what is going wrong? I feel like its probably a formatting issue, but I am comparing with previous coverage tables that ran successfully and I cannot find any noticeable differences ...

Best,
Francisco
franciscozorrilla commented 4 years ago

I took a closer look and I suspect that my method for generating the concoct_coverage.table would only work if the jgi_summarize_bam_contig_depths output depth files are generated on sorted bam files that are mapped against the cut up contigs. This is the only difference I can identify between my successful concoct runs and the above mentioned attempt. Do you think this is the root issue behind the error mentioned in the original post?

Before incorporating maxbin2, metabat2, and metaWRAP (bin refinement + reassembly modules) into my pipeline, I had been using kallisto to cross map samples as you suggested (#224) a while back. But now that I need to generate sorted bam files for maxbin2 and metabat2, the optimal solution seems to be to try to generate concoct input tables from those sorted bam files to avoid running each mapping step twice. I previously resorted to this double-mapping option for medium-sized datasets, although it seems like a slightly lazy and sub-optimal solution on my part. I would rather have all binners use input files generated from the same mapping operations.

I could theoretically avoid double-mapping by storing my sorted bam files generated against original contigs, and use the concoct_coverage_table.py script with a bedfile based on the cut up contigs to generate the desired concoct input table. However, this option seems feasible only for small datasets. I am currently working with 246 paied end WMGS samples from tara oceans and I would need ~1 petabyte to store 246^2 = 60516 bam files.

As I was writing this post I came up with the potential solution of generating an intermediate/individual coverage table for each of the 246^2 mapping operations at the end of each individual job, throwing away the bam files, and finally joining all the coverage individual tables corresponding to the same focal sample.

Small test for the same 3 samples as in the original post:

concoct_coverage_table.py contigs_10K.bed ERR599120.sort > coverage_table_ERR599120.tsv
concoct_coverage_table.py contigs_10K.bed ERR599121.sort > coverage_table_ERR599121.tsv
concoct_coverage_table.py contigs_10K.bed ERR599122.sort > coverage_table_ERR599122.tsv

paste coverage_table_ERR59912* > raw_master_covtable_coverage_table_ERR599120.tsv

$ less raw_master_covtable_coverage_table_ERR599120.tsv|cut -f1,2,4,6|head
contig  cov_mean_sample_ERR599120   cov_mean_sample_ERR599121   cov_mean_sample_ERR599122
k119_371504-flag=1-multi=2.0000-len=322.concoct_part_0  3.137   2.050   1.370
k119_451110-flag=1-multi=2.0000-len=321.concoct_part_0  1.885   0.000   0.000
k119_145948-flag=1-multi=1.0000-len=317.concoct_part_0  1.274   1.606   5.498
k119_530712-flag=1-multi=1.0000-len=391.concoct_part_0  2.660   3.455   1.026
k119_53072-flag=1-multi=1.0000-len=313.concoct_part_0   3.712   7.319   8.450
k119_0-flag=1-multi=2.0000-len=304.concoct_part_0   1.993   0.385   0.000
k119_610314-flag=1-multi=2.0000-len=303.concoct_part_0  3.231   1.851   3.198
k119_318432-flag=1-multi=2.0000-len=372.concoct_part_0  4.728   0.425   2.449
k119_119412-flag=1-multi=2.0000-len=368.concoct_part_0  6.486   0.533   3.413

less raw_master_covtable_coverage_table_ERR599120.tsv|cut -f1,2,4,6 > master_covtable_coverage_table_ERR599120.tsv

$ concoct --composition_file contigs_10K.fa --coverage_file master_covtable_coverage_table_ERR599120.tsv -b test/
WARNING:root:CONCOCT is running in single threaded mode. Please, consider adjusting the --threads parameter.
Up and running. Check /scratch/zorrilla/test/test/log.txt for progress
69547 2174 1
Setting 1 OMP threads
Generate input data
0,-3816556.871198,62412.992350
1,-3779115.587829,37441.283369
...

This seems to be working! I will close this issue for now but please let me know if you have any additional insights, comments, criticisms, etc.

Best, Francisco