Closed franciscozorrilla closed 4 years ago
I took a closer look and I suspect that my method for generating the concoct_coverage.table
would only work if the jgi_summarize_bam_contig_depths
output depth files are generated on sorted bam files that are mapped against the cut up contigs. This is the only difference I can identify between my successful concoct runs and the above mentioned attempt. Do you think this is the root issue behind the error mentioned in the original post?
Before incorporating maxbin2, metabat2, and metaWRAP (bin refinement + reassembly modules) into my pipeline, I had been using kallisto to cross map samples as you suggested (#224) a while back. But now that I need to generate sorted bam files for maxbin2 and metabat2, the optimal solution seems to be to try to generate concoct input tables from those sorted bam files to avoid running each mapping step twice. I previously resorted to this double-mapping option for medium-sized datasets, although it seems like a slightly lazy and sub-optimal solution on my part. I would rather have all binners use input files generated from the same mapping operations.
I could theoretically avoid double-mapping by storing my sorted bam files generated against original contigs, and use the concoct_coverage_table.py script
with a bedfile based on the cut up contigs to generate the desired concoct input table. However, this option seems feasible only for small datasets. I am currently working with 246 paied end WMGS samples from tara oceans and I would need ~1 petabyte to store 246^2 = 60516 bam files.
As I was writing this post I came up with the potential solution of generating an intermediate/individual coverage table for each of the 246^2 mapping operations at the end of each individual job, throwing away the bam files, and finally joining all the coverage individual tables corresponding to the same focal sample.
Small test for the same 3 samples as in the original post:
concoct_coverage_table.py contigs_10K.bed ERR599120.sort > coverage_table_ERR599120.tsv
concoct_coverage_table.py contigs_10K.bed ERR599121.sort > coverage_table_ERR599121.tsv
concoct_coverage_table.py contigs_10K.bed ERR599122.sort > coverage_table_ERR599122.tsv
paste coverage_table_ERR59912* > raw_master_covtable_coverage_table_ERR599120.tsv
$ less raw_master_covtable_coverage_table_ERR599120.tsv|cut -f1,2,4,6|head
contig cov_mean_sample_ERR599120 cov_mean_sample_ERR599121 cov_mean_sample_ERR599122
k119_371504-flag=1-multi=2.0000-len=322.concoct_part_0 3.137 2.050 1.370
k119_451110-flag=1-multi=2.0000-len=321.concoct_part_0 1.885 0.000 0.000
k119_145948-flag=1-multi=1.0000-len=317.concoct_part_0 1.274 1.606 5.498
k119_530712-flag=1-multi=1.0000-len=391.concoct_part_0 2.660 3.455 1.026
k119_53072-flag=1-multi=1.0000-len=313.concoct_part_0 3.712 7.319 8.450
k119_0-flag=1-multi=2.0000-len=304.concoct_part_0 1.993 0.385 0.000
k119_610314-flag=1-multi=2.0000-len=303.concoct_part_0 3.231 1.851 3.198
k119_318432-flag=1-multi=2.0000-len=372.concoct_part_0 4.728 0.425 2.449
k119_119412-flag=1-multi=2.0000-len=368.concoct_part_0 6.486 0.533 3.413
less raw_master_covtable_coverage_table_ERR599120.tsv|cut -f1,2,4,6 > master_covtable_coverage_table_ERR599120.tsv
$ concoct --composition_file contigs_10K.fa --coverage_file master_covtable_coverage_table_ERR599120.tsv -b test/
WARNING:root:CONCOCT is running in single threaded mode. Please, consider adjusting the --threads parameter.
Up and running. Check /scratch/zorrilla/test/test/log.txt for progress
69547 2174 1
Setting 1 OMP threads
Generate input data
0,-3816556.871198,62412.992350
1,-3779115.587829,37441.283369
...
This seems to be working! I will close this issue for now but please let me know if you have any additional insights, comments, criticisms, etc.
Best, Francisco
Hello @alneberg ,
I am trying to convert some jgi_summarize_bam_contig_depths (from metabat2) output files into a concoct_coverage table. If I understand correctly this should be possible, but I am running into problems. Here is what I have done:
$ head concoct_coverage.table contigName ERR599120.sort ERR599121.sort ERR599122.sort k119_371504-flag=1-multi=2.0000-len=322 4.19767 1.23256 1.47093 k119_451110-flag=1-multi=2.0000-len=321 2.56725 0 0 k119_145948-flag=1-multi=1.0000-len=317 1.55689 1.61677 2.56886 k119_530712-flag=1-multi=1.0000-len=391 2.07054 1.27801 0.925311 k119_53072-flag=1-multi=1.0000-len=313 2.45399 3.47239 4.76687
$ concoct --composition_file contigs_10K.fa --coverage_file concoct_coverage.table -b test/ WARNING:root:CONCOCT is running in single threaded mode. Please, consider adjusting the --threads parameter. Up and running. Check /scratch/zorrilla/test/test/log.txt for progress Traceback (most recent call last): File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/bin/concoct", line 90, in
results = main(args)
File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/bin/concoct", line 40, in main
args.seed
File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/concoct/transform.py", line 5, in perform_pca
pca_object = PCA(n_components=nc, random_state=seed).fit(d)
File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 344, in fit
self._fit(X)
File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 391, in _fit
copy=self.copy)
File "/g/scb2/patil/zorrilla/conda/envs/metabagpipesFinal/lib/python3.6/site-packages/sklearn/utils/validation.py", line 586, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 140)) while a minimum of 1 is required.