Parameters for featureCounts

Dries-B commented 10 months ago

Currently the rule featurecount looks as follows:

    threads: 4
    params:
        setting=lambda wildcards: "-Q 10 -B -p" if wildcards.seq_type == "pe" else ""

    shell:
        """
        mkdir -p {params.tmpdir}
        featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id -M  \
            {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
        """

I have some questions about this:

-M counts multi-mapping reads, but wouldn't it make more sense to use --primary instead, which only counts primary alignments?
-B only counts read pairs that have both ends aligned, but won't this risk losing the reads at the ends of contigs?
Wouldn't it make more sense to allow fragments to overlap multiple features by setting -O?
(I assume using -f for using feature-level instead of meta-feature level does not make a difference when analyzing Prodigal output.)
(I agree with the use of -p for counting fragments instead of reads, because otherwise single-read fragments would count less than paired-read fragments.)
(Less importantly: perhaps threads: 1 could be used, because this rule took very little time for me when running.)

johnne commented 10 months ago

Hi Dries, I haven't looked at this rule specifically in a long time. I think that when I first wrote it I had been working with metagenomic samples where a lot of reads mapped to more than one location and I didn't want to disregard these reads as it was likely that they were 'real' due to highly similar genomes in the samples.

You're right about the -B flag, it may have been added to increase the stringency somewhat.

Regarding -O: yes but also I think for a metagenomic setting we may in fact want to count reads mapping to entire contigs instead of the features defined in the prodigal GFF file. Then give features on contigs the abundances of reads mapped to entire contigs. This should be handled differently for metatranscriptomics though.

I agree that the settings for this rule could be made more configurable by the user. Easiest may be to simply add a featureCounts: section to the config file where all flags (except perhaps -p can be configured):

featureCounts:
  settings: "--primary -O"

rule featurecount:
    input:
        gff=results+"/annotation/{assembly}/final_contigs.features.gff",
        bam=results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}"+POSTPROCESS+".bam"
    output:
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.tsv",
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.tsv.summary"
    log:
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.log"
    threads: 4
    params:
        tmpdir=config["paths"]["temp"],
        p=lambda wildcards: "-p" if wildcards.seq_type == "pe" else "",
        setting=config["featureCounts"]["settings"],
    resources:
        runtime=lambda wildcards, attempt: attempt**2*30
    conda:
        "../envs/quantify.yml"
    shell:
        """
        mkdir -p {params.tmpdir}
        featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id {params.p} \
            {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
        """

Dries-B commented 9 months ago

Hi John,

Thanks for your reply! I should add that my questions are not necessarily proposals for improvements but rather 'assessments' of your opinion regarding changes in my repository, as well as sharing my understanding of the featureCounts documentation.

[Regarding -M] I think that when I first wrote it I had been working with metagenomic samples where a lot of reads mapped to more than one location and I didn't want to disregard these reads as it was likely that they were 'real' due to highly similar genomes in the samples.

Okay, I understand your reasoning there, although I think -M could risk duplicating counts compared to --primary.

I agree that the settings for this rule could be made more configurable by the user. Easiest may be to simply add a featureCounts: section to the config file where all flags (except perhaps -p can be configured)

Indeed this might improve the easy of use. This is not really necessary for my own use however, as I can modify the source code while looking at it.

Dries-B commented 9 months ago

Regarding -O: yes but also I think for a metagenomic setting we may in fact want to count reads mapping to entire contigs instead of the features defined in the prodigal GFF file. Then give features on contigs the abundances of reads mapped to entire contigs. This should be handled differently for metatranscriptomics though.

Okay, but for such a contig analyses, don't other simpler tools exist? (I myself am specifically interested in the Prodigal output.)

johnne commented 9 months ago

Okay, but for such a contig analyses, don't other simpler tools exist? (I myself am specifically interested in the Prodigal output.)

Sure, something like pileup.sh for instance.

NBISweden / nbis-meta

Parameters for featureCounts #35