read counts throughout the pipeline

omarkr8 commented 1 month ago

Hi,

I tend to get massive read loss at the end. something like retaining 1-5% of reads when my input is >100k reads to start. i understand this may all be quality filtering etc. but would be nice to know the exact counts at each stage where read loss can occur.

i have two questions: 1) similar to a dada2 table ive seen, having a .csv for key stages and how each sample retained reads would be helpful to know where the biggest drops occur, and i suppose a way to toggle when the reads should be discarded? could i know which stages these are, and where to look for read counts ?

2) for the pychopper step, i noticed they look for reads with both primers, and also label others as unusable. does this mean that only reads with two primers are carried forward? is there a way to toggle this off? i imagine even reads with one primer could be useful once clustered if its long enough.

tldr; useful to know where read loss occurs, also useful to make discarding reads optional at some points

thanks

dusti1n commented 1 month ago

Hello @omarkr8,

Thank you for the detailed explanation. Currently, the function to count the reads is not included, but you can use the tool seqkit manually to count the reads at important points in the workflow. This involves checking your input files step by step at critical stages, which will help you detect where read loss occurs. Unfortunately, you can't set this directly in the configuration, but I'll make a note of it. Adjusting the quality filtering settings might also help you achieve better results.

For the Pychopper step, Pychopper only retains reads with both primers (forward and reverse) by default. There is currently no option in Pychopper to bypass this strict requirement. As far as I know, there is no specific configuration available to change this behavior.

Our workflow currently uses the following parameters:

-u {output.unclass_fastq}: This specifies the file for unclassified reads. These are reads missing one or both primers or have an unexpected sequence structure and are considered unusable.
-w {output.rescue_fastq}: This specifies the file for rescued reads. These are reads that might have an unusual primer configuration but are potentially useful.

However, the discarded reads can still be analyzed by saving them. They are not completely lost. The storage location for the removed reads is:

pychopper/unclassified/
pychopper/rescued/

Should you have any further questions or require additional assistance, please do not hesitate to contact us.

Best regards, Dustin

omarkr8 commented 1 month ago

Thanks, this is very useful to note!

you mentioned important points, could you list a few? I would check: starting fastq, fastqc, pychopper, alignment/cluster. but im unsure where to find these output exactly. at which point is it a quality filter and which is a length filter? (e.g. looking at the cutadapt script, doesnt look like anything is discarded. although this is where i would normally discard reads without a primer hit & discard reads that dont fit the length filter)

I understand that the pipeline can be run step by step by passing the script commands. but what is the correct/expected order from start to finish?

for example, could i run the pipeline sequentially in order up to and including pychopper, then take reads from unclassified to put into passed reads, to include them in the downstream processes?

dusti1n commented 1 month ago

Thank you for your answer.

On the question of whether you can run the workflow up to and including Pychopper and then move the unclassified reads to the “passed reads”: This requires manual quality control and additional processing of the unclassified reads. Please note that the unclassified reads may not meet the same quality criteria as the classified reads, which is why they were originally unclassified. Careful quality control is therefore important to ensure that they are suitable for subsequent analyses. Perform manual quality control to identify acceptable reads that you can reuse based on this review.

It is important to note that this process is dynamic and highly dependent on the particular configuration and specific requirements of your experiment. The workflow may vary depending on the specific characteristics and requirements of your data. However, it always starts with a thorough analysis of the raw data, including quality checks and adapter trimming. Take a look at the pipeline diagram. You can also create a DAG plot to find out how your workflow is structured (including the input and output folders). First create the dataframe and then execute the command.

DAG-Plot command: snakemake --configfile your_config.yaml --filegraph | dot -Tpdf -o dag.pdf
Visualization of the pipeline: Pipeline

You can find the output folders of the built-in tools by looking at the output of the respective rule. For Cutadapt (rule cutadapt is located in the file: rule/read_assembly.smk), for example, it would be the following output folder: your_results_folder/assembly/{{sample}}_{{unit}}/{{sample}}_{{unit}}_{read}_cut.fastq. This file then contains the reads processed by Cutadapt. This is how you can find the output files of the respective tools.

I hope this will help you. Have a nice day.

Best regards, Dustin

dbeisser / Natrix2

read counts throughout the pipeline #19