datacarpentry / wrangling-genomics

Data Wrangling and Processing for Genomics
https://datacarpentry.org/wrangling-genomics/
Other
71 stars 152 forks source link

Consider updating the trimming tool to something a bit friendlier #232

Open JCSzamosi opened 2 years ago

JCSzamosi commented 2 years ago

When I taught this recently, we found using Trimmomatic to be a lot of cognitive overhead for learners, because its syntax is so different from the other tools in the workshop, and because it's so easy to make typos and need to re-run everything. Some learners and instructors suggested changing the tool to one that is a little less hostile in its syntax. Suggestions included cutadapt and fastp

sstevens2 commented 2 years ago

We also had a hard time with Trimmomatic in the lesson. It has a really long set of arguments which makes it so easy to make a mistake while typing. I moved running this into a script but we still had to troubleshoot a lot of errors that were related to typos. I've not used the other tools suggested by @JCSzamosi but if it could be swapped out, that would be ideal. Expect this change might be something that the CAC might need to discuss but I would be happy to help with a PR on it.

harbi811 commented 1 year ago

I have used TrimGalore and found it a lot simpler to teach than Trimmomatic. TrimGalore is a wrapper around cutadapt and FastQC with the ability to detect common standard adaptors like Nextera used in the example. I am willing to contribute to rewriting this lesson using TrimGalore which would make the lesson easier to understand.

For example, there would be no need to copy a file of adapter sequences into the current working directory and the arguments are very descriptive. An equivalent command for the example using TrimGalore is. This also runs a FastQC step after trimming to check the quality of the trimmed reads. In that way, a person is able to see if the trimming was effective.

trim_galore --paired --phred33 --cores 4 --quality 20 --length 25 --nextera --fastqc SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz
JCSzamosi commented 1 year ago

I'm not a maintainer on this lesson, but I sure would welcome a fork like that even if it never gets merged back into the Carpentries curriculum. If you do create such a fork, please let us know here!

Also thanks for the tip about TrimGalore! I will check it out for my own use!

LandiMi2 commented 4 months ago

Yes, I agree. trim_galore is a lot simpler to teach—also, one thing to note is that some flags, such as trimming at least 10 bases 5'end, should be included as there are some sequence biases at the 5' end of the current data (teaching this genomics course this week).