FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
459 stars 149 forks source link

Trim galore config for pacbio data #185

Closed YutongLei2020 closed 5 months ago

YutongLei2020 commented 5 months ago

Hi,

I'm currently working on trimming PacBio fastq data and have encountered some challenges with finding the optimal configuration settings for Trim Galore. Initially, my configuration was as follows:: trim_galore --trim-n \ --quality 5\ --phred33 \ --length 20 \ --output_dir "/srv/disk00/leiy28/venus_longread_test/trim_test2/$(basename ${i})" \ --gzip ${file} However, this setup did not result in any of the reads being trimmed. After adjusting the quality setting to --quality 20 , about one-third of the reads were trimmed. Despite this change, I'm uncertain whether this adjustment optimally suits the characteristics of PacBio data.

Could you provide any recommendations or guidance on configuring Trim Galore specifically for PacBio data? Additionally, how can I determine the most effective configuration settings for processing this type of data?

Thank you!

FelixKrueger commented 5 months ago

Hi @YutongLei2020

Trim Galore is really designed with Illumina data in mind, and I am afraid I am not familiar enough with PacBio data to give you any recommendations. Some quick googling around this topic didn't seem to return too many hits apart from "it really depends on the downstream applications....". I realise it's a bit cheesy to ask our new AI friends, but I was curious what they thought about it. Here is goes:

PacBio (Pacific Biosciences) sequencing data is known for its long reads, which can be beneficial for certain types of analyses such as de novo genome assembly or structural variant detection. However, PacBio data is also known for having a higher error rate compared to other sequencing technologies, such as Illumina. These errors are often random and are distributed across the length of the reads. Therefore, traditional trimming tools like Trim Galore, which are often used to remove low-quality ends of reads, are not typically used with PacBio data. Instead, specialized tools for PacBio data, such as the suite of tools provided by PacBio called SMRT Link, are often used. These tools include algorithms for error correction that are specifically designed for the error profile of PacBio data.

To be honest, this sounds pretty good to me....

YutongLei2020 commented 5 months ago

Thank you for the response! I guess I should explore trimming tools specifically designed for long reads.