BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
119 stars 48 forks source link

ERROR with concoct_coverage_table.py #question #error #291

Open mutantjoo0 opened 4 years ago

mutantjoo0 commented 4 years ago

Hello CONCOCT team (@andand @alneberg @binnisb @inodb )

I am getting error while running concoct_coverage_table.py. I am using CONCOCT on miniconda environment as follows:

# packages in environment at /mnt/home/leejooy5/miniconda3:
#
# Name                    Version                   Build  Channel
concoct                   1.1.0            py37h88e4a8a_0    bioconda

My contigs fa files were generated with megahit and mapped/sorted/indexed bam files were generated with bbmap.

First, following the workflow described in Basic Usage, I cut contigs by running cut_up_fasta.py -c 10000 -m -b ./concoct_bins/SAMPLE_10k.bed > ./concoct_bins/SAMPLE_10k.fa ./input_megahit_contigs/SAMPLE.final.contigs.fa. Then, I got error in second step by running concoct_coverage_table.py. I tested with multiple samples, same error occurred repeatedly. I assumed that CONCOCT does not recognize columns in bed file resulted from first step, cut_up_fasta.py. I have attached example of my error as follows:

$ concoct_coverage_table.py ./Std_10k.bed ../bbmap_sorted_indexed_BAM_BAI/Std_megahit_BBmapped_sorted.bam > Std_cov_tab.tsv

[W::hts_idx_load2] The index file is older than the data file: ../bbmap_sorted_indexed_BAM_BAI/DEB_megahit_BBmapped_sorted.bam.bai
Errors in BED line 'k141_528    0      303      k141_528.concoct_part_0'
Errors in BED line 'k141_1406   0      385      k141_1406.concoct_part_0'
Errors in BED line 'k141_3681   0      330      k141_3681.concoct_part_0'
.
.
Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/bin/concoct_coverage_table.py", line 91, in <module>
    generate_input_table(args.bedfile, args.bamfiles, samplenames=samplenames)
  File "/mnt/home/leejooy5/miniconda3/bin/concoct_coverage_table.py", line 61, in generate_input_table
    df = pd.read_table(fh, header=None)
  File "/mnt/home/leejooy5/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/mnt/home/leejooy5/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/mnt/home/leejooy5/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/mnt/home/leejooy5/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/mnt/home/leejooy5/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

I wonder which step I can fix this issue, for example, running cut_up_fasta.pywith increased or decreased -c setting. Please enlighten me. Thank you for your time and support in advance.

added on June 30, 2020


I tried a couple of different approaches to fix this issue. 1) use different parameters for step1. cut_up.fasta -c10k->20k; 2) create a new concoct environment and install concoct there; 3) install optional dependencies: bedtools, picard, samtools, bowtie2, gnu parallel, pysam in concoct environment.

The followings are corresponding results from each trial. Please note that I only copied and pasted commands and Traceback parts shown errors.

1) errors from running on concoct environment

$ concoct_coverage_table.py ./DEB_10k.bed ../bbmap_sorted_indexed_BAM_BAI/DEB_megahit_BBmapped_sorted.bam > DEB_cov_tab.tsv

Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/bin/concoct_coverage_table.py", line 77, in <module>
    generate_input_table(args.bedfile, args.bamfiles, samplenames=samplenames)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/bin/concoct_coverage_table.py", line 48, in generate_input_table
    df = pd.read_table(fh, header=None)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

2) errors from different parameter set in step 1

$ concoct_coverage_table.py ./DEB_20k.bed ../bbmap_sorted_indexed_BAM_BAI/DEB_megahit_BBmapped_sorted.bam > DEB_20k_cov_tab.tsv

Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/bin/concoct_coverage_table.py", line 77, in <module>
    generate_input_table(args.bedfile, args.bamfiles, samplenames=samplenames)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/bin/concoct_coverage_table.py", line 48, in generate_input_table
    df = pd.read_table(fh, header=None)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

3) errors after installing optional dependencies

$ concoct_coverage_table.py DEB_10k.bed ./bbmap_sorted_indexed_BAM_BAI/DEB_megahit_BBmapped_sorted.bam > DEB_cov.txt

samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/bin/concoct_coverage_table.py", line 77, in <module>
    generate_input_table(args.bedfile, args.bamfiles, samplenames=samplenames)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct_env/bin/concoct_coverage_table.py", line 28, in generate_input_table
    sys.stderr.write(out)
TypeError: write() argument must be str, not bytes

As you can see running concoct_coverage_table.py after installing optional dependencies reduced numbers of errors. However there are still errors regarding line 77 and line 28. Please help me to understand/fix this problem.

Stay healthy, Joo-Young

franciscozorrilla commented 4 years ago

Hi Joo-Young,

I believe that the problems you are experiencing are due to your misuse of the cut_up_fasta.py script:

$ cut_up_fasta.py -h
usage: cut_up_fasta.py [-h] [-c CHUNK_SIZE] [-o OVERLAP_SIZE] [-m]
                       [-b BEDFILE]
                       contigs [contigs ...]

Cut up fasta file in non-overlapping or overlapping parts of equal length.

Optionally creates a BED-file where the cutup contigs are specified in terms
of the original contigs. This can be used as input to concoct_coverage_table.py.

positional arguments:
  contigs               Fasta files with contigs

optional arguments:
  -h, --help            show this help message and exit
  -c CHUNK_SIZE, --chunk_size CHUNK_SIZE
                        Chunk size
  -o OVERLAP_SIZE, --overlap_size OVERLAP_SIZE
                        Overlap size
  -m, --merge_last      Concatenate final part to last contig
  -b BEDFILE, --bedfile BEDFILE
                        BEDfile to be created with exact regions of the
                        original contigs corresponding to the newly created
                        contigs

In your code you wrote:

cut_up_fasta.py -c 10000 -m -b ./concoct_bins/SAMPLE_10k.bed > ./concoct_bins/SAMPLE_10k.fa ./input_megahit_contigs/SAMPLE.final.contigs.fa

However, your input contigs ./input_megahit_contigs/SAMPLE.final.contigs.fa are on the right hand side of the output redirect (>). In bash terms, this means that the output of what is generating by the left hand side will be stored in the file specified on the right hand side. The correct way to write this would be:

cut_up_fasta.py -c 10000 -o 0 -m ./input_megahit_contigs/SAMPLE.final.contigs.fa -b ./concoct_bins/SAMPLE_10k.bed  >  ./concoct_bins/SAMPLE_10k.fa 

Note that the above command should generate both the BED file (SAMPLE_10k.bed) and the cut up assembly (SAMPLE_10k.fa). Then generate the coverage table using the BED file + the sorted BAM file:

concoct_coverage_table.py contigs_10K.bed sorted.bam> coverage_table.tsv

Now we can run CONCOCT:

concoct --coverage_file coverage_table.tsv \
    --composition_file SAMPLE_10k.fa \
    -b sample_ID 

Note that this approach assumes that you sorted bam file was generating by mapping your short reads against the ORIGINAL assembly, not the cut up assembly. Please also have a look at the CONCOCT help file to see additional useful parameters such as -c and -t:

$ concoct -h
cut_up  usage: concoct [-h] [--coverage_file COVERAGE_FILE]
               [--composition_file COMPOSITION_FILE] [-c CLUSTERS]
               [-k KMER_LENGTH] [-t THREADS] [-l LENGTH_THRESHOLD]
               [-r READ_LENGTH] [--total_percentage_pca TOTAL_PERCENTAGE_PCA]
               [-b BASENAME] [-s SEED] [-i ITERATIONS]
               [--no_cov_normalization] [--no_total_coverage]
               [--no_original_data] [-o] [-d] [-v]

optional arguments:
  -h, --help            show this help message and exit
  --coverage_file COVERAGE_FILE
                        specify the coverage file, containing a table where
                        each row correspond to a contig, and each column
                        correspond to a sample. The values are the average
                        coverage for this contig in that sample. All values
                        are separated with tabs.
  --composition_file COMPOSITION_FILE
                        specify the composition file, containing sequences in
                        fasta format. It is named the composition file since
                        it is used to calculate the kmer composition (the
                        genomic signature) of each contig.
  -c CLUSTERS, --clusters CLUSTERS
                        specify maximal number of clusters for VGMM, default
                        400.
  -k KMER_LENGTH, --kmer_length KMER_LENGTH
                        specify kmer length, default 4.
  -t THREADS, --threads THREADS
                        Number of threads to use
  -l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
                        specify the sequence length threshold, contigs shorter
                        than this value will not be included. Defaults to
                        1000.
  -r READ_LENGTH, --read_length READ_LENGTH
                        specify read length for coverage, default 100
  --total_percentage_pca TOTAL_PERCENTAGE_PCA
                        The percentage of variance explained by the principal
                        components for the combined data.
  -b BASENAME, --basename BASENAME
                        Specify the basename for files or directory where
                        outputwill be placed. Path to existing directory or
                        basenamewith a trailing '/' will be interpreted as a
                        directory.If not provided, current directory will be
                        used.
  -s SEED, --seed SEED  Specify an integer to use as seed for clustering. 0
                        gives a random seed, 1 is the default seed and any
                        other positive integer can be used. Other values give
                        ArgumentTypeError.
  -i ITERATIONS, --iterations ITERATIONS
                        Specify maximum number of iterations for the VBGMM.
                        Default value is 500
  --no_cov_normalization
                        By default the coverage is normalized with regards to
                        samples, then normalized with regards of contigs and
                        finally log transformed. By setting this flag you skip
                        the normalization and only do log transorm of the
                        coverage.
  --no_total_coverage   By default, the total coverage is added as a new
                        column in the coverage data matrix, independently of
                        coverage normalization but previous to log
                        transformation. Use this tag to escape this behaviour.
  --no_original_data    By default the original data is saved to disk. For big
                        datasets, especially when a large k is used for
                        compositional data, this file can become very large.
                        Use this tag if you don't want to save the original
                        data.
  -o, --converge_out    Write convergence info to files.
  -d, --debug           Debug parameters.
  -v, --version         show program's version number and exit

After running CONCOCT you will probably be interested in looking at the scripts merge_cutup_clustering.py and extract_fasta_bins.py to extract your draft bins.

Hope this helps! Francisco

mutantjoo0 commented 4 years ago

Hi Francisco,

Thank you for your help. I started over from step 1 and noticed changes between my previous outputs and outputs from the suggested command. However, error still occurs as follows.

(concoct) -bash-4.2$ concoct_coverage_table.py GR25_c10k.bed ./bbmap_sorted_indexed_BAM_BAI/GR25_megahit_BBmapped_sorted.bam > GR25_c10k_covtab.tsv
[W::hts_idx_load2] The index file is older than the data file: ./bbmap_sorted_indexed_BAM_BAI/GR25_megahit_BBmapped_sorted.bam.bai
Errors in BED line 'k107_12810  0       301     k107_12810.concoct_part_0'
Errors in BED line 'k107_5125   0       231     k107_5125.concoct_part_0'
Errors in BED line 'k107_10248  0       404     k107_10248.concoct_part_0'
Errors in BED line 'k107_23058  0       328     k107_23058.concoct_part_0'
Errors in BED line 'k107_38430  0       346     k107_38430.concoct_part_0'
.
.
.
Errors in BED line 'k107_23054  0       784     k107_23054.concoct_part_0'
Errors in BED line 'k107_23055  0       495     k107_23055.concoct_part_0'
Errors in BED line 'k107_23056  0       391     k107_23056.concoct_part_0'
Errors in BED line 'k107_23057  0       753     k107_23057.concoct_part_0'
Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct_coverage_table.py", line 91, in <module>
    generate_input_table(args.bedfile, args.bamfiles, samplenames=samplenames)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct_coverage_table.py", line 61, in generate_input_table
    df = pd.read_table(fh, header=None)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/pandas/io/parsers.py", line 1891, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 532, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Line numbers in Traceback were partly different from those in previous errors, but descriptions still look same. Additionally I noticed a sort of warning [W::hts_idx_load2] The index file is older than the data file: ./bbmap_sorted_indexed_BAM_BAI/GR25_megahit_BBmapped_sorted.bam.bai popped up before countless Errors began. I wonder if this warning means concoct does not allow bbmap-generated bam files for input or my bam files might have another issue. For example, my input BAM files looks like this:

(concoct) -bash-4.2$ samtools view bbmap_sorted_indexed_BAM_BAI/GR25_megahit_BBmapped_sorted.bam | head -n 3
K00392:163:H2H55BBXY:8:2106:22029:7169  99      k107_12810 flag=1 multi=4.0000 len=301  1       45      150=    =       129     278    ATTCCATATTTTGAACACTTACTATCACATTTTTATAATGCTCTATATTTTTCTCAGCTTCTGCTATGGTTTTCTTTTGTTTGTCTGTTAGGGTCGTAGACAACAAGATTTGCTCTACCTTCTTCTCTTCGCCTCGCTTGTACGTATCAA   AAAFFJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJFJJJJJJJJJJJJJFJJJJAJFFJJJJFJJJJJJJFJJJJJJJJJJFJFJJJFJJJJJJJJJJJJJAJJJFJJJJJJJ  NM:i:0  AM:i:45
K00392:163:H2H55BBXY:8:2106:22658:7205  99      k107_12810 flag=1 multi=4.0000 len=301  1       45      150=    =       129     278    ATTCCATATTTTGAACACTTACTATCACATTTTTATAATGCTCTATATTTTTCTCAGCTTCTGCTATGGTTTTCTTTTGTTTGTCTGTTAGGGTCGTAGACAACAAGATTTGCTCTACCTTCTTCTCTTCGCCTCGCTTGTACGTATCAA   AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ  NM:i:0  AM:i:45
K00392:163:H2H55BBXY:8:2205:21582:45924 145     k107_12810 flag=1 multi=4.0000 len=301  6       45      150=    k107_37816 flag=0 multi=5.9522 len=1152 771     0       ATATTTTGAACACTTACTATCACATTTTTATAATGCTCTATATTTTTCTCAGCTTCTGCTATGGTTTTCTTTTGTTTGTCTGTTAGGGTCGTAGACAACAAGATTTGCTCTACCTTCTTCTCTTCGCCTCGCTTGTACGTATCAAGTGTA  JJJJJJJF<<JJJJJJJJJJFAJJJJJJJJFAAJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFAFAA  NM:i:0  AM:i:45

Also, I am wondering if you have used coverage files generated with other tools such as BBMap (pileup.sh) and MetaBAT2 (jgi_summarize_bam_contig_depths) since I already have two versions of coverage depth.txt files generated from BBmap and MetaBAT2 as follows:

(concoct) -bash-4.2$ wc -lmw cov_depth_input-bbmap_sorted_indexed_BAM-202006*/GR25*
  54591  436725 3651726 cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_megahit_bbmap_sorted_depth.txt
  54591  764271 4907010 cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.txt
 109182 1200996 8558736 total

(concoct) -bash-4.2$ head -n 3 cov_depth_input-bbmap_sorted_indexed_BAM-202006*/GR25*
==> cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_megahit_bbmap_sorted_depth.txt <==
contigName      contigLen       totalAvgDepth   GR25_megahit_BBmapped_sorted.bam        GR25_megahit_BBmapped_sorted.bam-var
k107_12810 flag=1 multi=4.0000 len=301  301     5.7351  5.7351  1.98269
k107_5125 flag=0 multi=0.8468 len=231   231     10.9012 10.9012 3.99012

==> cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.txt <==
#ID     Avg_fold        Length  Ref_GC  Covered_percent Covered_bases   Plus_reads      Minus_reads     Read_GC Median_fold     Std_Dev
k107_12810 flag=1 multi=4.0000 len=301  5.0797  301     0.0000  100.0000        301     7       4       0.3603  4       1.70
k107_5125 flag=0 multi=0.8468 len=231   25.8571 231     0.0000  100.0000        231     24      22      0.4298  27      12.89

I tried running concoct --coverage_file --composition_file with two different coverage files and concoct complaint again as shown below.

#GR25, 1st try:
(concoct) -bash-4.2$ concoct -t 10 -l 1500 --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_megahit_bbmap_sorted_depth.txt --composition_fil GR25_c10k.fa -b ./GR25/GR25_j
usage: concoct [-h] [--coverage_file COVERAGE_FILE]
               [--composition_file COMPOSITION_FILE] [-c CLUSTERS]
               [-k KMER_LENGTH] [-t THREADS] [-l LENGTH_THRESHOLD]
               [-r READ_LENGTH] [--total_percentage_pca TOTAL_PERCENTAGE_PCA]
               [-b BASENAME] [-s SEED] [-i ITERATIONS]
               [--no_cov_normalization] [--no_total_coverage]
               [--no_original_data] [-o] [-d] [-v]
concoct: error: unrecognized arguments: --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_megahit_bbmap_sorted_depth.txt

#convert txt to tsv cp *txt *tsv

#GR25, 2nd try:
(concoct) -bash-4.2$ concoct -t 10 -l 1500 --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_cov.tsv --composition_file GR25_c10k.fa -b ./GR25/GR25_j
usage: concoct [-h] [--coverage_file COVERAGE_FILE]
               [--composition_file COMPOSITION_FILE] [-c CLUSTERS]
               [-k KMER_LENGTH] [-t THREADS] [-l LENGTH_THRESHOLD]
               [-r READ_LENGTH] [--total_percentage_pca TOTAL_PERCENTAGE_PCA]
               [-b BASENAME] [-s SEED] [-i ITERATIONS]
               [--no_cov_normalization] [--no_total_coverage]
               [--no_original_data] [-o] [-d] [-v]
concoct: error: unrecognized arguments: --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_cov.tsv

#GR25, 3rd try: 
(concoct) -bash-4.2$ concoct -t 10 -l 1500 --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.txt --composition_file GR25_c10k.fa -b ./GR25/GR25_b
usage: concoct [-h] [--coverage_file COVERAGE_FILE]
               [--composition_file COMPOSITION_FILE] [-c CLUSTERS]
               [-k KMER_LENGTH] [-t THREADS] [-l LENGTH_THRESHOLD]
               [-r READ_LENGTH] [--total_percentage_pca TOTAL_PERCENTAGE_PCA]
               [-b BASENAME] [-s SEED] [-i ITERATIONS]
               [--no_cov_normalization] [--no_total_coverage]
               [--no_original_data] [-o] [-d] [-v]
concoct: error: unrecognized arguments: --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.txt
#GR25, 4th try:
(concoct) -bash-4.2$ concoct -t 10 -l 1500 --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.tsv --composition_file GR25_c10k.fa -b ./GR25/GR25_b
usage: concoct [-h] [--coverage_file COVERAGE_FILE]
               [--composition_file COMPOSITION_FILE] [-c CLUSTERS]
               [-k KMER_LENGTH] [-t THREADS] [-l LENGTH_THRESHOLD]
               [-r READ_LENGTH] [--total_percentage_pca TOTAL_PERCENTAGE_PCA]
               [-b BASENAME] [-s SEED] [-i ITERATIONS]
               [--no_cov_normalization] [--no_total_coverage]
               [--no_original_data] [-o] [-d] [-v]
concoct: error: unrecognized arguments: --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.tsv

I assumed that concoct-generated coverage file must have unique header format and different numbers of columns. Could you help me to modify headers and columns in my coverage files? If you can provide an example of concoct coverage file, it will be a great clue for me to step forward. Thank you for your time and help again!

Stay healthy, Joo-Young

franciscozorrilla commented 4 years ago

Hi Joo-Young,

Sorry to hear that you are having trouble interacting with CONCOCT and your conda environments. I can also suggest you try out kbase. This platform allows you to more easily interact with a HPCC using a graphical user interface, giving you access to CONCOCT and many other bioinformatic tools without having to install/setup anything or write any code yourself. However, you do need to make a free account and upload your files to their servers. Although this can be prohibitive, I think its still a good place to start to play around with these tools without getting bogged down by the installation/troubleshooting details. There are other alternative platforms too, e.g. galaxy.

I see spelling mistakes in the parameters in all four of your attempts to run CONCOCT. Particularly, you consistently misspelled the parameter --coverage_file as --coverge_file, thus the error you see in every attempt:

concoct: error: unrecognized arguments: --coverge_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_megahit_bbmap_sorted_depth.txt

I have never used BBmap for CONCOCT, generally I use bwa-mem for any mapping operations. If you want to use MetaBat2's jgi_summarize_bam_contig_depths, or take a look at what a working coverage table looks like I can refer to you to this post #286, particularly the second comment:

I took a closer look and I suspect that my method for generating the concoct_coverage.table would only work if the jgi_summarize_bam_contig_depths output depth files are generated on sorted bam files that are mapped against the cut up contigs.

As you can see in the post, if I have 3 samples then a coverage table may look like this:

$ less master_covtable_coverage_table_ERR599120.tsv|head
contig  cov_mean_sample_ERR599120   cov_mean_sample_ERR599121   cov_mean_sample_ERR599122
k119_371504-flag=1-multi=2.0000-len=322.concoct_part_0  3.137   2.050   1.370
k119_451110-flag=1-multi=2.0000-len=321.concoct_part_0  1.885   0.000   0.000

If you are only mapping against the focal sample then the coverage table would only have 2 columns: one with contig IDs (note that the .concoct_part_X extension in the IDs lets us know that these are the "cut-up" contigs that we want) and one with the corresponding coverage.

Best of luck! Francisco

mutantjoo0 commented 3 years ago

Hi Francisco,

Thank you for your kind support. I did re-run without typos in command. As you can see below, using coverage file, either from jgi_summarize_bam_contigs_depthsor pileup, I got same errors.

using coverage file from jgi_summarize_bam_contigs_depths (metabat2):

(concoct) -bash-4.2$ concoct -l 1500 --coverage_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_cov.tsv --composition_file GR25_c10k.fa -b ./concoct_output/GR25/
WARNING:root:CONCOCT is running in single threaded mode. Please, consider adjusting the --threads parameter.
Up and running. Check /mnt/ufs18/rs-002/Reguera_Kashefi_Lab/JYL/phylophlan_MAGs/concoct_bins/concoct_output/GR25/log.txt for progress
/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/concoct/input.py:115: RuntimeWarning: divide by zero encountered in log
  cov.loc[:,cov_range[0]:cov_range[1]])
Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct", line 90, in <module>
    results = main(args)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct", line 40, in main
    args.seed
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/concoct/transform.py", line 5, in perform_pca
    pca_object = PCA(n_components=nc, random_state=seed).fit(d)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 351, in fit
    self._fit(X)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 398, in _fit
    ensure_2d=True, copy=self.copy)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/base.py", line 420, in _validate_data
    X = check_array(X, **check_params)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/utils/validation.py", line 654, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 141)) while a minimum of 1 is required.

using coverage file from pileup.sh (BBMap):

(concoct) -bash-4.2$ concoct -t 10 -l 1500 --coverage_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200623_BBMAP_pileup/GR25_cov.txt --composition_file GR25_c10k.fa -b ./concoct_output/GR25b/
Up and running. Check /mnt/ufs18/rs-002/Reguera_Kashefi_Lab/JYL/phylophlan_MAGs/concoct_bins/concoct_output/GR25b/log.txt for progress
/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/concoct/input.py:115: RuntimeWarning: divide by zero encountered in log
  cov.loc[:,cov_range[0]:cov_range[1]])
Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct", line 90, in <module>
    results = main(args)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct", line 40, in main
    args.seed
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/concoct/transform.py", line 5, in perform_pca
    pca_object = PCA(n_components=nc, random_state=seed).fit(d)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 351, in fit
    self._fit(X)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 398, in _fit
    ensure_2d=True, copy=self.copy)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/base.py", line 420, in _validate_data
    X = check_array(X, **check_params)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/utils/validation.py", line 654, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 147)) while a minimum of 1 is required.

Then, I tried to used modify coverage file from metabat2 as described your post([#286]``` (https://github.com/BinPro/CONCOCT/issues/286)) and run concoct.

coverage files from metabat2:

(concoct) -bash-4.2$ head -n 5 ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_megahit_bbmap_sorted_depth.txt
contigName      contigLen       totalAvgDepth   GR25_megahit_BBmapped_sorted.bam        GR25_megahit_BBmapped_sorted.bam-var
k107_12810 flag=1 multi=4.0000 len=301  301     5.7351  5.7351  1.98269
k107_5125 flag=0 multi=0.8468 len=231   231     10.9012 10.9012 3.99012
k107_10248 flag=1 multi=1.0000 len=404  404     4.05512 4.05512 3.47521
k107_23058 flag=1 multi=2.0000 len=328  328     1.79213 1.79213 0.470674

generating new coverage file:

(concoct) -bash-4.2$ for depth in *depth.txt;do less $depth|cut -f4 > $depth.col;done
(concoct) -bash-4.2$ less GR25_megahit_bbmap_sorted_depth.txt|cut -f1 > GR25_rownames
(concoct) -bash-4.2$ paste GR25_rownames GR25_megahit_bbmap_sorted_depth.txt.col > GR25_concoct_cov.table

new coverage file:

(concoct) -bash-4.2$ head -n 5 ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_concoct_cov.table
contigName      GR25_megahit_BBmapped_sorted.bam
k107_12810 flag=1 multi=4.0000 len=301  5.7351
k107_5125 flag=0 multi=0.8468 len=231   10.9012
k107_10248 flag=1 multi=1.0000 len=404  4.05512
k107_23058 flag=1 multi=2.0000 len=328  1.79213

run concoct with new coverage file:

(concoct) -bash-4.2$ concoct -l 1500 -t 30 --coverage_file ../cov_depth_input-bbmap_sorted_indexed_BAM-20200617_jgi_summarize_bam_contigs_depth/GR25_concoct_cov.table --composition_file GR25_c10k.fa -b ./concoct_output/GR25_concoct
Up and running. Check /mnt/ufs18/rs-002/Reguera_Kashefi_Lab/JYL/phylophlan_MAGs/concoct_bins/concoct_output/GR25_concoct_log.txt for progress
Traceback (most recent call last):
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct", line 90, in <module>
    results = main(args)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/bin/concoct", line 40, in main
    args.seed
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/concoct/transform.py", line 5, in perform_pca
    pca_object = PCA(n_components=nc, random_state=seed).fit(d)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 351, in fit
    self._fit(X)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/decomposition/_pca.py", line 398, in _fit
    ensure_2d=True, copy=self.copy)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/base.py", line 420, in _validate_data
    X = check_array(X, **check_params)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/mnt/home/leejooy5/miniconda3/envs/concoct/lib/python3.6/site-packages/sklearn/utils/validation.py", line 654, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 138)) while a minimum of 1 is required.

After all, I got sill same error. I noticed you used 1 composition file and 1 coverage file for 3 samples. I wonder if that is must do to use concoct and how I can make that with multiple samples having different numbers of contigs. Thank you for your help!

Cheers, Joo-Young Thanks