Sydney-Informatics-Hub / RNASeq-DE

GNU General Public License v3.0
1 stars 2 forks source link

RNASeq-DE

Description

RNASeq-DE is a highly scalable workflow that pre-processes Illumina RNA sequencing data for differential expression (raw FASTQ to counts) on the National Compute Infrastructure, Gadi. The workflow was designed to efficiently process and manage large scale projects (100s of samples sequenced over multiple batches).

In summary, the steps of this workflow include:

  1. Set up
  2. QC of raw FASTQs: FastQC and MultiQC to obtain quality reports on raw fastq files
  3. Trim raw FASTQs: BBduk trim to trim 3' adapters and poly A tails. [Optional] - QC trimmed FASTQs.
  4. Mapping: STAR for spliced aware alignment of RNA sequencing reads (FASTQ) to a reference genome
    • Prepare reference: STAR indexing for a given reference genome and corresponding annotation file
    • Perform mapping with trimmed FASTQs to prepared reference
    • Compress unmapped reads with pigz (optional)
    • Outputs: BAM per FASTQ pair (sequencing batch), unmapped reads, alignment stats
  5. Merge lane level to sample level BAMs: SAMtools to merge and index sample BAMs
    • Merge multiplexed BAMs into sample level BAMs. For samples which are not multiplexed, files are renamed/symlinked for consistent file naming.
    • Outputs: <sampleid>.final.bam and <sampleid>_<flowcell>.final.bam (and index files)
  6. Mapping metrics
    • RSeQC infer_experiment.py - check strand awareness, required for counting. Output: per sample reports which are summarized by cohort in ../QC_reports/<cohort>_final_bams_inter_experiment/<cohort>.txt
    • RSeQC read_distribution.py - checks which features reads aligned to for each sample (summarized with multiqc). Expect ~30% of reads to align to CDS exons (provides total reads, total tags, total tags assigned. Groups by: CDS exons, 5' UTR, 3' UTR, Introns, TSS up and down regions).
    • [Optional] RSeQC bam_stat.py - for each BAM, print QC metrics (numbers are read counts) including: Total records, QC failed, PCR dups, Non primary hits, unmapped, mapq, etc (similar metrics are provided by STAR, but can be used on sample level BAMs).
    • [Optional] summarize_STAR_alignment_stats.pl: collates per sample STAR metric per flowcell level BAM (use read_distribution for sample level BAMs). Uses datasets present in a cohort.config file to find these BAMs. Inputs: per sample *Log.final.out. Output: `../QC_reports/_STAR_metrics.txt
    • [Optional] SAMtools idxstats: summarize number of reads per chromosome (useful for inferring gender, probably needs a bit more work)
  7. Raw counts: HTSeq
    • Count reads in <sampleid>.final.bam that align to features in your reference and create a count matrix for a cohort
    • Output: <sample>.counts per input BAM and a count matrix as <cohort>.counts
  8. Normalized counts: BAMtools/TPMCalculator
    • Obtain TPM normalized counts for gene and transcripts in your reference from <sampleid>.final.bam and create a TPM count matrix for a cohort
    • Output: Per sample TPM counts and cohort count matricies as TPM_TranscriptLevel.counts, TPM_GeneLevel.counts

User guide

Set up

The scripts in this repository use relative paths and require careful setup to ensure that scripts can locate input files seamlessly. To start:

git clone https://github.com/Sydney-Informatics-Hub/RNASeq-DE
cd RNASeq-DE

Required inputs and directory structure

Please provide the following files to run the workflow:

Your RNASeq-DE directory structure should match the following:

├── Batch_1
│   ├── sample1_1.fastq.gz
│   └── sample1_2.fastq.gz
├── Batch_2
│   └── sample2.fastq.gz
├── README.md
├── References
│   ├── GRCh38
│   │   └── Homo_sapiens.GRCh38.dna.primary_assembly.fa
│   │   └── Homo_sapiens.GRCh38.94.gtf
│   └── GRCm38
│       └── Mus_musculus.GRCm38.dna.toplevel.fa
│       └── Mus_musculus.GRCm38.98.gtf
├── cohort.config
└── Scripts

.config

The .config file is tab-delimited text file that is used to tell the scripts which samples to process, how to process them, and where it can locate relevant input files. An example is provided below:

#FASTQ SAMPLEID DATASET REFERENCE SEQUENCING_CENTRE PLATFORM RUN_TYPE_SINGLE_PAIRED LIBRARY
sample1_1.fastq.gz SAMPLEID1 Batch_1 GRCh38 KCCG ILLUMINA PAIRED 1
sample1_2.fastq.gz SAMPLEID1 Batch_1 GRCh38 KCCG ILLUMINA PAIRED 1
sample2.fastq.gz SAMPLEID2 Batch_2 GRCm38 KCCG ILLUMINA SINGLE 1

To create a .config using excel:

Column descriptions for cohort.config:

Column name Description
#FASTQ FASTQ filename. This column can be populated with ls *f*q.gz | sort -V in your sequencing batch directory. Paired files are recognised by conventional filenames - 1 or 2 at the end of the filename, before the fastq.gz extension e.g. <sampleid>_<flowcell>_<tag>_<lane>_R<1|2>.fastq.gz
SAMPLEID The sample identifier used in your laboratory. Sample IDs are used to name output files. Avoid whitespace.
DATASET Directory name containing the sample FASTQs. This is typically analogous to the sequencing batch that the FASTQ file was generated.
REFERENCE Reference subdirectory name, e.g. GRCh38 or GRCm38 in the above example (must case match). Scripts will use reference files (.fasta and .gtf) and STAR index files for the FASTQ file/sample for alignment and counting.
SEQUENCING_CENTRE e.g. AGRF. This is used in the read group header in the output BAM file for the aligned FASTQ. Avoid whitespace.
PLATFORM e.g. ILLUMINA. This is used in the read group header in the output BAM file for the aligned FASTQ.
RUN_TYPE_SINGLE_PAIRED Input SINGLE or PAIRED. This is used to indicate whether you want to process single read data or paired end data (STAR and BBduk trim).
LIBRARY The sequencing library of the FASTQ file. This is used in the read group header in the output BAM file for the aligned FASTQ. Use 1 if unknown. No whitespace please.

Running the pipeline

Run all scripts from the Scripts directory once you have completed set up.

Generally, steps involve:

  1. Running a <task>_make_input.sh script to prepare for parallel processing.
    • This makes an inputs file, e.g. ./Inputs/<task>.inputs
    • This will often use your <cohort>.config file to know which files or samples you would like to process in a single job
  2. Running a <task>_run_parallel.pbs script.
    • This launches multiple tasks (e.g. ./Scripts/<task>.sh) in parallel
    • Each line of ./Inputs/<task>.inputs is used as input into a single <task>.sh
    • Compute resources should be scaled to the size of your data. Recommendations are provided in the user guide.

The steps below explain how to process samples in cohort.config. Replace cohort.config with the path to your own .config file.

1. QC of raw FASTQs

This step performs FastQC to obtain quality reports per input FASTQ file. For multiple FASTQ files, reports can be summarized with MultiQC.

To run FastQC for all raw FASTQ files in your cohort.config file, create input file for parallel processing:

sh fastqc_make_input.sh cohort.config

Edit fastqc_run_parallel.pbs by:

Submit fastqc_run_parallel.pbs to perform FastQC in parallel (1 fastq file = 1 fastqc.sh task) by:

qsub fastqc_run_parallel.pbs

Once fastqc_run_parallel.pbs is complete, you can summarize reports using:

sh multiqc.sh ../dataset_fastQC

2. Trim raw FASTQs

This step trims raw FASTQ files using BBDuk trim.

Task scripts bbduk_trim_paired.sh and bbduk_trim_single.share apply the following settings by default:

To run BBDuk trim for FASTQ files in cohort.config, create input file for parallel processing:

sh bbduk_trim_make_input.sh cohort.config

Edit bbduk_trim_run_parallel.pbs by:

Submit bbduk_trim_run_parallel.pbs. This launches parallel tasks for: 1 FASTQ pair = 1 input for bbduk_trim_paired.sh, 1 FASTQ file = 1 input for bbduk_trim_single.sh:

qsub bbduk_trim_run_parallel.pbs

QC of trimmed FASTQs (optional)

You can check the quality of the data after trimming using:

sh fastqc_trimmed_make_input.sh cohort.config

Edit fastqc_run_parallel.pbs by:

qsub fastqc_trimmed_run_parallel.pbs

3. Mapping

Preparing your reference for STAR

Each reference under the "REFERENCE" column in cohort.config needs to be prepared with STAR before mapping. For most, this will only be one reference genome, prepared once per project.

Edit the variables with the required inputs and parameters in star_index.pbs in the section shown below:

# Change dir, ref, gtf and overhang variables below
dir=../Reference/GRCh38
# ref and gtf files under dir
ref=Homo_sapiens.GRCh38.dna.primary_assembly.fa
gtf=Homo_sapiens.GRCh38.103.gtf
# sjdbOverhang = ReadLength-1 (for creating splice junction database)
overhang=149

The default compute parameters are sufficient for human, mouse or other similar genome.

Submit the job:

qsub star_index.pbs

Mapping trimmed reads to prepared reference

This step will map trimmed FASTQ files to a prepared reference genome using STAR.

To map all trimmed reads for to references specified in cohort.config file, prepare inputs for parallel processing by:

sh star_align_trimmed_make_input.sh cohort.config

star_align_run_parallel.pbs run task scripts star_align_paired.sh and/or star_align_single.sh and by default:

Edit star_align_run_parallel.pbs by:

Submit star_align_run_parallel.pbs. This launches parallel tasks for: 1 trimmed FASTQ pair = 1 input for star_align_paired_with_unmapped.sh, 1 trimmed FASTQ file = 1 input for star_align_single_with_unmapped.sh:

qsub star_align_run_parallel.pbs

Compress unmapped reads with pigz (optional)

STAR outputs unmapped reads as unzipped FASTQ files. To save disk, we can compress these files using pigz.

To do this, edit pigz_run_parallel.pbs by:

The task pigz.sh will delete the original file by default.

Submit your job by:

qsub pigz_run_parallel.pbs

4. Merging lane level BAMs into sample level BAMs

This step merges sample lane level BAMs into sample level BAMs (skipped if samples were not multiplexed). All final BAMs are organised into cohort_final_bam directory and are then indexed with SAMtools.

To obtain final BAMs for sample IDs in cohort.config, create input file for parallel processing:

sh samtools_merge_index_make_input.sh cohort.config

Edit samtools_merge_index_run_parallel.pbs by:

Submit the job by:

qsub samtools_merge_index_run_parallel.pbs

5. Mapping metrics

RSeQC's infer_experiment.py

This step uses RSeQC's infer_experiment.py to infer the library strand awareness (i.e. forward, reverse, or not strand aware) that was used to prepare the samples for sequencing. This is required for htseq-count.

The infer_experiment_final_bams.sh script processes multiple BAMs in parallel on the login node/command line. With the path to the directory containing *final.bam files, e.g. ../cohort_final_bams:

sh infer_experiment_final_bams.sh ../cohort_final_bams

RSeQC's read_distribution.py

This step uses RSeQC's read_distribution.py to check the distribution of aligned reads across features (e.g. exon, intron, UTR, intergenic, etc).

To obtain read_distribution.py reports for sample BAMs in cohort.config, create input file for parallel processing:

sh read_distribution_make_input.sh cohort.config

read_distribution_run_parallel.pbs will run task script read_distribution.sh with read_distribution.py default settings applied.

Edit read_distribution_run_parallel.pbs by:

Once read_distribution_run_parallel.pbs is complete, you can summarize reports using:

sh multiqc.sh ../QC_reports/cohort_read_distribution

RSeQC's bam_stat.py (optional)

This step uses RSeQC's bam_stat.py to check alignment metrics of BAM files, including: Total records, QC failed, PCR dups, Non primary hits, unmapped, mapq, etc.

To obtain bam_stat.py reports for sample BAMs in cohort.config, create input file for parallel processing:

sh bam_stat_make_input.sh cohort.config

bam_stat_run_parallel.pbs will run task script bam_stat.sh with bam_stat.py default settings applied.

Edit bam_stat_run_parallel.pbs by:

Once bam_stat_run_parallel.pbs is complete, you can summarize reports using:

sh multiqc.sh ../QC_reports/cohort_final_bams_bam_stat

summarise_STAR_alignment_stats.pl

The summarise_STAR_alignment_stats.pl script will collate mapping information from STAR output for each aligned file.

Run the following command on the login node, on the command line:

perl summarise_STAR_alignment_stats.pl cohort.config

SAMtools idxstats

This step runs samtools idxstats for all BAMs in a directory, and summarize the reports with multiQC.

Run this on the login node, providing the path to your directory containing BAM files, e.g.:

sh samtools_idxstats_final_bams.sh ../cohort_final_bams

6. Raw counts

This step uses htseq-count to obtain aligned read counts present across features in a genome.

Counts per sample

To obtain counts from final BAMs for sample IDs in cohort.config:

sh htseq-count_custom_make_input.sh cohort.config

Note: htseq-count_custom_make_input.sh will search for all .final.bam in cohort_final_bams, including sample flowcell level BAMs if available. To use only sample level bams, create inputs with htseq-count_make_input.sh

htseq-count_run_parallel.pbs will run task script htseq-count.sh with htseq-count recommended settings.

Edit htseq-count_run_parallel.pbs by:

Cohort count matrix (optional)

This step creates a standard count matrix file (genes = rows, sampleIDs = columns) using htseq-count output. All samples within your .config file that have <sampleid>.counts file available will be included in the final matrix.

For smaller cohorts (<100 samples), run on the login node:

perl htseq-count_make_matrix_custom.pl cohort.config

For very large cohorts (>100 samples), run as a job by editing htseq-count_make_matrix_custom.pbs by:

Submit the job by:

qsub htseq-count_make_matrix_custom.pbs

7. Normalize counts

This step quantifies transcript abundance for genomic features (gene, transcript, exon, intron) as transcripts per million (TPM) normalized counts.

TPM counts per sample

To obtain TPM normalized counts across features for samples in cohort.config, create input file by:

sh tpmcalculator_make_input.sh cohort.config

tpmcalculator_run_parallel.pbs will run task script tpmcalculator.sh. By default, this applies the TPMCalculator's developer's settings and addtional settings including:

Edit tpmcalculator_run_parallel.pbs by:

TPM cohort count matrix (optional)

This step creates two TPM cohort count matricies, one at the gene level, the other at the transcript level.

Edit the tpmcalculator_make_matrix.pbs script:

Run the script:

qsub tpmcalculator_make_matrix.pbs

Benchmarking

#JobName CPUs_requested CPUs_used Mem_requested Mem_used CPUtime CPUtime_mins Walltime_req Walltime_used Walltime_mins JobFS_req JobFS_used Efficiency Service_units(CPU_hours) Job_exit_status Date Time
bam_stat.o 112 112 512.0GB 269.54GB 208:45:39 12525.65 10:00:00 2:05:41 125.68 400.0MB 8.26MB 0.89 293.26 0 23/02/2022 11:16:58
bbduk_trim.o 48 48 190.0GB 169.67GB 22:36:20 1356.33 4:00:00 0:40:14 40.23 100.0MB 8.17MB 0.7 64.37 0 22/02/2022 15:29:34
fastqc.o 30 30 150.0GB 106.41GB 8:18:12 498.2 5:00:00 0:26:32 26.53 100.0MB 8.12MB 0.63 33.17 0 22/02/2022 15:16:39
fastqc_trimmed.o 30 30 150.0GB 97.6GB 8:12:31 492.52 2:00:00 0:25:08 25.13 100.0MB 8.12MB 0.65 31.42 0 22/02/2022 15:58:45
htseq-count.o 240 240 950.0GB 582.53GB 2562:06:24 153726.4 30:00:00 19:51:09 1191.15 500.0MB 8.33MB 0.54 9529.2 0 24/02/2022 4:58:30
htseq-count_matrix.o 1 1 32.0GB 16.02GB 0:02:53 2.88 5:00:00 0:03:45 3.75 100.0MB 0B 0.77 3 0 24/02/2022 9:03:09
multiqc_all_datasets_trimmed_fastQC.o 1 1 32.0GB 26.59GB 0:06:07 6.12 5:00:00 0:09:37 9.62 100.0MB 1.45MB 0.64 7.69 0 4/03/2022 15:28:28
pigz.o 28 28 128.0GB 33.98GB 1:25:20 85.33 2:00:00 0:07:15 7.25 100.0MB 8.1MB 0.42 4.23 0 22/02/2022 17:32:47
read_distribution.o 144 144 570.0GB 419.27GB 278:19:37 16699.62 10:00:00 2:28:54 148.9 300.0MB 8.4MB 0.78 714.72 0 23/02/2022 11:42:52
samtools_merge_index.o 48 48 190.0GB 99.15GB 86:30:00 5190 10:00:00 2:44:15 164.25 100.0MB 8.25MB 0.66 262.8 0 22/02/2022 20:18:34
star_align_trimmed_unmapped_out.o 240 240 950.0GB 736.62GB 42:09:54 2529.9 24:00:00 0:21:28 21.47 500.0MB 8.17MB 0.49 171.73 0 22/02/2022 15:59:35
tpmcalculator_transcript.o 768 768 2.97TB 2.63TB 978:52:39 58732.65 10:00:00 5:14:53 314.88 1.56GB 8.38MB 0.24 8061.01 0 23/02/2022 14:31:26
tpmtranscript_matrix.o 1 1 96.0GB 53.19GB 0:37:32 37.53 5:00:00 0:45:33 45.55 100.0MB 0B 0.82 109.32 0 24/02/2022 13:54:22

Workflow summaries

Metadata

metadata field workflow_name / workflow_version
Version 1.0.0
Maturity stable
Creators Tracy Chew
Source NA
License GNU GENERAL PUBLIC LICENSE
Workflow manager None
Container None
Install method Manual
GitHub NA
bio.tools NA
BioContainers NA
bioconda NA

Component tools

The software listed below are used in the RNASeq-DE pipeline. Some of these are installed globally on NCI Gadi (check with module avail for the current software). Install python3 packages by module load python3/3.8.5, and then using the pip3 install commands. These will be installed in $HOME. All other software need to be installed in your project's /scratch directory and module loadable.

openmpi/4.0.2 (installed globally)

nci-parallel/1.0.0a (installed globally)

SAMtools/1.10 (installed globally)

python3/3.8.5 (installed globally)

fastQC/0.11.7

multiqc/1.9

BBDuk/37.98

STAR/2.7.3a

rseqc/4.0.0

htseq-count/0.12.4

bamtools/2.5.1, TPMCalculator/0.0.4

Required (minimum) inputs/parameters

Help / FAQ / Troubleshooting

License(s)

GNU General Public License v3.0

Acknowledgements/citations/credits

Authors

Acknowledgements

Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities. We suggest including the following acknowledgement in any publications that follow from this work:

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia.

Documentation was created following the Australian BioCommons documentation guidelines.

Cite us to support us!

Chew, T., & Sadsad, R. (2022). RNASeq-DE (Version 1.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.152.1

References

Anders, S., Pyl, P.T., Huber, W., 2014. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu638

Andrews, S. 2010. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]

Aronesty, E. 2011. ea-utils : "Command-line tools for processing biological sequencing data"; https://github.com/ExpressionAnalysis/ea-utils

BBMap - Bushnell B. - sourceforge.net/projects/bbmap/

Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., Gingeras, T.R., 2012. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. https://doi.org/10.1093/bioinformatics/bts635

Ewels, P., Magnusson, M., Lundin, S., Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. https://doi.org/10.1093/bioinformatics/btw354

Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., & 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352

Vera Alvarez, R., Pongor, L.S., Mariño-Ramírez, L., Landsman, D., 2018. TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty896

Wang, L., Wang, S., Li, W., 2012. RSeQC: quality control of RNA-seq experiments. Bioinformatics. https://doi.org/10.1093/bioinformatics/bts356