How do I run code faster? | Slow code execution on Raven for `fastqc ./*.fastq.gz` on 123 fastq.gz files

RobertsLab / resources

https://robertslab.github.io/resources/

19 stars 11 forks source link

How do I run code faster? | Slow code execution on Raven for `fastqc ./*.fastq.gz` on 123 fastq.gz files #1980

Closed sarahtanja closed 1 month ago

sarahtanja commented 1 month ago

I'm working in Raven within a conda environment to execute commands from FastQC, MulitQC, & Fastp for QAQC of RNA-seq data for 63 paired-end samples. This means I've got 123 fastq.gz files (each is ~1.5GB). It took ~7+hrs to run fastqc ./*.fastq.gz yesterday to generate fastqc.html reports for each fastq.gz file. Link to full script here

Is this normal for the number & size of files?

I'll need to continue working with this dataset, but hopefully faster...

Any tips on how I can run commands with this many files faster? (Will Mox/Klone UW HPC speed this up?)

kubu4 commented 1 month ago

You'll definitely want to specify a thread count for FastQC!!!

Example:

fastqc ./*.fastq.gz \
--threads ${threads} \
--quiet

FastQC will use 250MB of memory (RAM) per thread. Raven has 256GB of RAM, and has 48 available threads.

You'll probably also want to use the --quiet option, if you're running this in an interactive fashion (i.e. RStudio). If you don't use the --quiet option, FastQC will print TONS and TONS and TONS of lines to the screen, which will effectively lockup your Rstudio instance.

And, for any subsequent software you use for any project, you'll want to see if they offer the option to specify a thread count. Most do offer this option.

Fastp offers the option, but it will only run a maximum of 16 threads.

MultiQC does not offer the option, but is ridiculously fast, since it's really only scanning for output files (it's not actually doing any sort of "real" processing).

kubu4 commented 1 month ago

Side note:

You can also specify an output directory for FastQC with the --outdir ${trimmed_fastqs_dir} option. Then, you won't have to do the extra step of moving output files around, like you have in your script.

kubu4 commented 1 month ago

Aaaand, another side note. Since you're going through this whole process, it might be helpful to glance at a notebook post I have, doing this exact thing!

https://robertslab.github.io/sams-notebook/posts/2024/2024-10-05-FastQC-Trimming-and-QC---A.pulchra-RNA-seq-from-Azenta-Project-30-1047560508-Using-fastp/

Notice during the fastp step that I trim 20bp from the 5' ends of each of the reads. You'll definitely want to do that!

kubu4 commented 1 month ago

Also, if you don't want to deal with the conda environment stuff, we have all of these programs already installed on Raven...

Check /home/shared for most programs.

Conda (Mamba) installs are here: /home/sam/programs/mambaforge/bin/

sarahtanja commented 1 month ago

Thank you for the link to your notebook post! I switched to using bash variables in R code chunks instead of running in a conda evironment and was able to run the code faster and get expected outputs!