Closed sarahtanja closed 1 month ago
You'll definitely want to specify a thread count for FastQC!!!
Example:
fastqc ./*.fastq.gz \
--threads ${threads} \
--quiet
FastQC will use 250MB of memory (RAM) per thread. Raven has 256GB of RAM, and has 48 available threads.
You'll probably also want to use the --quiet
option, if you're running this in an interactive fashion (i.e. RStudio). If you don't use the --quiet
option, FastQC will print TONS and TONS and TONS of lines to the screen, which will effectively lockup your Rstudio instance.
And, for any subsequent software you use for any project, you'll want to see if they offer the option to specify a thread count. Most do offer this option.
Fastp offers the option, but it will only run a maximum of 16 threads.
MultiQC does not offer the option, but is ridiculously fast, since it's really only scanning for output files (it's not actually doing any sort of "real" processing).
Side note:
You can also specify an output directory for FastQC with the --outdir ${trimmed_fastqs_dir}
option. Then, you won't have to do the extra step of moving output files around, like you have in your script.
Aaaand, another side note. Since you're going through this whole process, it might be helpful to glance at a notebook post I have, doing this exact thing!
Notice during the fastp step that I trim 20bp from the 5' ends of each of the reads. You'll definitely want to do that!
Also, if you don't want to deal with the conda environment stuff, we have all of these programs already installed on Raven...
Check /home/shared
for most programs.
Conda (Mamba) installs are here: /home/sam/programs/mambaforge/bin/
Thank you for the link to your notebook post! I switched to using bash variables in R code chunks instead of running in a conda evironment and was able to run the code faster and get expected outputs!
I'm working in Raven within a conda environment to execute commands from FastQC, MulitQC, & Fastp for QAQC of RNA-seq data for 63 paired-end samples. This means I've got 123 fastq.gz files (each is ~1.5GB). It took ~7+hrs to run
fastqc ./*.fastq.gz
yesterday to generate fastqc.html reports for each fastq.gz file. Link to full script hereIs this normal for the number & size of files?
I'll need to continue working with this dataset, but hopefully faster...
Any tips on how I can run commands with this many files faster? (Will Mox/Klone UW HPC speed this up?)