BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
120 stars 48 forks source link

Questions: concoct_inputtable.tsv has all zeros, megahit options, checking .bam files #214

Closed franciscozorrilla closed 5 years ago

franciscozorrilla commented 5 years ago

Hello,

I have tried replacing velvet with megahit in my pipeline and everything seems to work fine. However, when I get to generating the coverage table for the concoct input, I end up with a table with all zeros (besides length). Also, there are an order of magnitude more contigs in the concoct_inputtable.tsv when I use megahit compared to when I used velvet. Is this to be expected?

Is my empty coverage table perhaps the result of bowtie alignment gone wrong? Is there some quick way to double check this? Or could it have something to do with the fact that gen_input_table.py is expecting velvet format input?

Here are my relevant snakemake rules:

configfile: "config.yaml"

IDs = "Sample118_s1e5_R1.fa Sample120_s1e5_R1.fa Sample127_s1e5_R1.fa Sample134_s1e5_R1.fa Sample177_s1e5_R1.fa Sample215_s1e5_R1.fa Sample230_s1e5_R1.fa Sample234_s1e5_R1.fa Sample244_s1e5_R1.fa Sample261_s1e5_R1.fa Sample263_s1e5_R1.fa Sample290_s1e5_R1.fa Sample302_s1e5_R1.fa Sample321_s1e5_R1.fa Sample330_s1e5_R1.fa Sample343_s1e5_R1.fa".split()

rule all:
    input:"evaluation-output/ClusterPlot.pdf","evaluation-output/clustering_gt1000_scg.tab"
        #"evaluation-output/ClusterPlot.pdf","evaluation-output/clustering_gt1000_scg.pdf","concoct-output/clustering_gt1000_l.csv","all_R1.fa"
    shell:"snakemake --dag | dot -Tpng > pipemap.png"

rule mergeReads:
    output:"All_{read}.fa"
    shell:"cat {config[paths][raw_reads]}/Sample*_{wildcards.read}.fa  > {output}"

rule megahit:
    input:R1="All_R1.fa",R2="All_R2.fa"
    output:"megahitAssembly"
    shell:"megahit --min-count {config[megahit_params][mincount]} --k-list {config[megahit_params][klistdefault]} -1 {input.R1} -2 {input.R2} -o {output}"

rule cutcontigs:
    input:"megahitAssembly"
    output:"megahitAssembly/megahit_c10K.fa"
    shell:"set +u;source activate concoct_env;set -u;python {config[cutcontigs_params][script_dir]} -c {config[cutcontigs_params][chunk_size]} -o {config[cutcontigs_params][o]} -m {input}/final.contigs.fa > {output}"

rule bowtie:
    input:"megahitAssembly/megahit_c10K.fa"
    output:"map/{id}"
    shell:
        """
        export MRKDUP={config[bowtie_params][MRKDUP_jardir]};
        set +u;source activate concoct_env;set -u;
        bowtie2-build {input} {input};
        mkdir -p {output}
        echo $(basename /{output})
        cd {output};
        bash {config[bowtie_params][MRKDUP_shelldir]} -ct 1 -p '-f' {config[paths][raw_reads]}/$(basename /{output}) $(echo {config[paths][raw_reads]}/$(basename /{output}) | sed s/R1/R2/) pair {config[paths][concoct_run]}/{input} asm bowtie2;
        cd ../..;
        """

rule covtable:
    input: expand("map/{id}", id=IDs)
    output:"concoct-input/concoct_inputtable.tsv"
    shell:
        """
        set +u;source activate concoct_env;set -u;
        cd {config[paths][concoct_run]}/{config[bowtie_params][outputdir]}
        python {config[paths][CONCOCT]}/scripts/gen_input_table.py --isbedfiles \
            --samplenames <(for s in Sample*; do echo $s | cut -d'_' -f1; done) \
            ../megahitAssembly/megahit_c10K.fa */bowtie2/asm_pair-smds.coverage > concoct_inputtable.tsv;
        mkdir -p {config[paths][concoct_run]}/{config[covtable_params][outdir]};
        mv concoct_inputtable.tsv {config[paths][concoct_run]}/{config[covtable_params][outdir]}/;
        """

config.yaml


paths:
    CONCOCT: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/temp/repos/CONCOCT
    CONCOCT_TEST: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/temp/CONCOCT-test-data-0.3.2
    raw_reads: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/temp/CONCOCT-test-data-0.3.2/reads
    concoct_run: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/snakemake-concot

megahit_params:
    mincount: 2
    klistcomplex: 27,37,47,57,67,77,87
    klistdefault: 21,41,61,81,99

cutcontigs_params:
    chunk_size: 10000
    script_dir: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/temp/repos/CONCOCT/scripts/cut_up_fasta.py
    o: 0

bowtie_params:
    MRKDUP_jardir: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/.conda/envs/concoct_env/share/picard-2.18.4-0/MarkDuplicates.jar
    MRKDUP_shelldir: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/temp/repos/CONCOCT/scripts/map-bowtie2-markduplicates.sh
    outputdir: map

covtable_params:
    outdir: concoct-input

Thanks in advance!

alneberg commented 5 years ago

Hi @franciscozorrilla, My first instict is that there might be something going wrong in the mapping. Have you verified that the mapping actually worked? Perhaps by inspecting some of the .bam files. I am afraid I have trouble reading the command you're using for mapping in your snakefile.

franciscozorrilla commented 5 years ago

Hey @alneberg, thanks for your response.

Turns out I had a simple typo in my megahit rule. And yeah my bad the shell commands are pretty illegible without the config file. I've updated my original post code to include this.

My coverage table is no longer empty, but I do still end up with an assembly file with way more (short) contigs when I use megahit vs when I use velvet. Any idea why this is the case? Would you suggest maybe increasing the default --min-contig-len from 200 to 1000?

Also I would still like to check my .bam files, is there a particular tool or sanity check you would recommend for this purpose?

alneberg commented 5 years ago

Ah great that the input file problem was resolved. I am afraid I haven't tried Velvet in a really long time and I am not really in a position to decide that for you. My belief was that megahit was able to manage a lot larger dataset with a smaller memory footprint. Also did you check whether the amount of bases in long contigs changed between the two? Since CONCOCT ignores contigs shorter than 1000 bases by default, I would focus on the contigs longer than the cutoff you intend to use.