NBISweden / nbis-meta

A snakemake workflow for metagenomic projects
https://nbis-metagenomic-workflow.readthedocs.io/
MIT License
14 stars 9 forks source link

Parameters for featureCounts #35

Open Dries-B opened 10 months ago

Dries-B commented 10 months ago

Currently the rule featurecount looks as follows:

    threads: 4
    params:
        setting=lambda wildcards: "-Q 10 -B -p" if wildcards.seq_type == "pe" else ""
    shell:
        """
        mkdir -p {params.tmpdir}
        featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id -M  \
            {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
        """

I have some questions about this:

johnne commented 10 months ago

Hi Dries, I haven't looked at this rule specifically in a long time. I think that when I first wrote it I had been working with metagenomic samples where a lot of reads mapped to more than one location and I didn't want to disregard these reads as it was likely that they were 'real' due to highly similar genomes in the samples.

You're right about the -B flag, it may have been added to increase the stringency somewhat.

Regarding -O: yes but also I think for a metagenomic setting we may in fact want to count reads mapping to entire contigs instead of the features defined in the prodigal GFF file. Then give features on contigs the abundances of reads mapped to entire contigs. This should be handled differently for metatranscriptomics though.

I agree that the settings for this rule could be made more configurable by the user. Easiest may be to simply add a featureCounts: section to the config file where all flags (except perhaps -p can be configured):

featureCounts:
  settings: "--primary -O"
rule featurecount:
    input:
        gff=results+"/annotation/{assembly}/final_contigs.features.gff",
        bam=results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}"+POSTPROCESS+".bam"
    output:
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.tsv",
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.tsv.summary"
    log:
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.log"
    threads: 4
    params:
        tmpdir=config["paths"]["temp"],
        p=lambda wildcards: "-p" if wildcards.seq_type == "pe" else "",
        setting=config["featureCounts"]["settings"],
    resources:
        runtime=lambda wildcards, attempt: attempt**2*30
    conda:
        "../envs/quantify.yml"
    shell:
        """
        mkdir -p {params.tmpdir}
        featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id {params.p} \
            {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
        """
Dries-B commented 9 months ago

Hi John,

Thanks for your reply! I should add that my questions are not necessarily proposals for improvements but rather 'assessments' of your opinion regarding changes in my repository, as well as sharing my understanding of the featureCounts documentation.

[Regarding -M] I think that when I first wrote it I had been working with metagenomic samples where a lot of reads mapped to more than one location and I didn't want to disregard these reads as it was likely that they were 'real' due to highly similar genomes in the samples.

Okay, I understand your reasoning there, although I think -M could risk duplicating counts compared to --primary.

I agree that the settings for this rule could be made more configurable by the user. Easiest may be to simply add a featureCounts: section to the config file where all flags (except perhaps -p can be configured)

Indeed this might improve the easy of use. This is not really necessary for my own use however, as I can modify the source code while looking at it.

Dries-B commented 9 months ago

Regarding -O: yes but also I think for a metagenomic setting we may in fact want to count reads mapping to entire contigs instead of the features defined in the prodigal GFF file. Then give features on contigs the abundances of reads mapped to entire contigs. This should be handled differently for metatranscriptomics though.

Okay, but for such a contig analyses, don't other simpler tools exist? (I myself am specifically interested in the Prodigal output.)

johnne commented 9 months ago

Okay, but for such a contig analyses, don't other simpler tools exist? (I myself am specifically interested in the Prodigal output.)

Sure, something like pileup.sh for instance.