dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Read filtering questions #93

Closed kternus closed 6 years ago

kternus commented 6 years ago

Hey @charlesreid1 or @brooksph,

As I was updating the read filtering workflow figure, I realized that I wasn't sure how things were happening. Could you help me out by answering these questions when you have a chance?

Does the user select any threshold for quality filtering, or are they limited to the either a binary conservative (2) or aggressive (30) choice?

Is the read filtering quality threshold selection happening at the quality_trimming step with Trimmomatic or the interleave_reads step with khmer? Or is an independent quality threshold selection happening at both steps?

Would the khmer step only happen if there were paired-end reads, or is khmer needed for additional read trimming for all samples regardless of single/paired-end status?

Thanks for your help!

-Krista

kternus commented 6 years ago

read_filtering

charlesreid1 commented 6 years ago

Does the user select any threshold for quality filtering, or are they limited to the either a binary conservative (2) or aggressive (30) choice?

Short answer: the user can specify any value they want.

Long answer: In the quality_trimming rule we call trimmomatic with a parameter called qual that represents the numerical quality value. The quality value ends up as part of the file name (something like _trim2 or _trim30), so the user can specify a different quality value by requesting a file like _trim5 or _trim10.

Here is the relevant snakemake rule/trimmomatic call from the read filtering set of rules:

rule quality_trimming:
    ....
    shell:
        'trimmomatic PE '
        '{params.quality_input_wc} '
        '{params.quality_output_wc} '
        'ILLUMINACLIP:/{adapter_file}:2:40:15 '
        'LEADING:{params.qual} '
        'TRAILING:{params.qual} '
        'SLIDINGWINDOW:4:{params.qual} '
        'MINLEN:25 '

The quality_input and quality_output bit is the name of the file, with _trimX added on (that way, each unique trim value maps to a unique file).

Is the read filtering quality threshold selection happening at the quality_trimming step with Trimmomatic or the interleave_reads step with khmer? Or is an independent quality threshold selection happening at both steps?

It happens with the quality_trimming step. The interleave_reads rule takes, as inputs, the names of the .fq.gz files that you want to interleave. These inputs to interleave_reads are already quality-trimmed, so they already have the quality parameter in their filename. If the trimming has not been performed, it triggers the quality_trimming rule to actually do the quality trimming first.

Would the khmer step only happen if there were paired-end reads, or is khmer needed for additional read trimming for all samples regardless of single/paired-end status?

I do not know the answer to this one, so will leave it to @brooksph or @ctb to answer.

kternus commented 6 years ago

Great, thanks for those explanations @charlesreid1!

In the read filtering documentation, we'll probably want to include some guidance about how users should select an appropriate qual parameter value (e.g., "2" is a conservative value recommended for assembly, "30" is an aggressive value recommended for taxonomic classification, "0" is an extreme value that could be used if all sequence data other than adapters needs to be retained for downstream analysis, "X" values are invalid). I'm sure @brooksph can help put together some guidance based on the work that he's done. He may also know of other resources/links describing how to interpret FastQC reports that would be good to include in the documentation. E.g., Is there a critical metric that users should look for in the FastQC report after the post_trimming_quality_assessment is completed, and if that metric isn't hit, should the user go back and change their original qual parameter until that metric is reached? Should the use of the read filtering workflow be a single-use event with a qual value determined by the type of downstream analysis the user wants to do, or is the selection of a qual value meant to be an iterative process towards achieving a specific QC goal?

It sounds like khmer is only being used for interleaving paired-end reads, and khmer would not do any additional trimming beyond what's specified with trimmomatic. So I think it would be safe to alter the workflow figure to look like this (others can correct me if this is wrong!): read_filtering

brooksph commented 6 years ago

Hi @kternus the figure looks great. +1 for Charles' comments. Khmer can be used for k-mer trimming but we are only using it for interleaving reads in our workflows. The interleaving step is optional and highly dependent upon downstream analysis. Some tools require the reads to be interleaved prior to input while others can do it on the fly. If the reads happen to overlap, the user might use a tool like Pandaseq to merge the reads prior to downstream analyses. We don't have any overlapping read datasets. Since interleaving can be slow, and it's optional, I think we should change the yes to 'yes/no' or 'optional'.

kternus commented 6 years ago

Ok, thanks @brooksph! If PANDAseq will be included in the dahak 1.0 release, then it seems like we should include it in the figure too. That way dahak users will know all of the possible read filtering options, even if it's not relevant to the Shakya et al. dataset.

This may not be a great layout for the figure, but here's Read_Filtering_v2.png!

read_filtering_v2

brooksph commented 6 years ago

Unfortunately, PANDAseq will not be part of Dahak 1.0.

kternus commented 6 years ago

Alright, how about we revert back to the first version of the figure with a minor edit, and then put a note in the documentation about how PANDAseq may be added and tested within dahak in the future?

_mondavi_illos_read_filtering

kternus commented 6 years ago

We should also make a note in the documentation about which dahak workflows/tools need the optional khmer step.

Thanks!

kternus commented 6 years ago

Here's a version of the updated read filtering figure without the colored shapes:

read_filtering

This version also includes MultiQC, since it sounds like that will be added to the read filtering workflow in the future.