Shell sript to preprocess single- or paired-end CAGE-sequencing data as generated by the SLIC-CAGE protocol (Cvetesic et al., 2018). The script requires fastq file(s) as input, trims off barcode sequences from the reads before mapping them to the corresponding genome with STAR
and finally generates bigWig files that serve as input for downstream analysis with PRIME
or CAGEfightR
. Optional pre-processing steps include 1) filtering of reads derived from abundant, ribosomal RNAs via rRNAdust
as recommended by the FANTOM consortium, 2) STAR mapping supported by personal variants that overlap the aigned reads for which vcf files would be required and 3) removal of Gs at the read's 5' end that don't match the genome sequence and are due to the 3' overhang produced by the Superscript reverse transcriptase during cDNA first-strand synthesis. Please see Overview of data processing steps for further details on all pre-processing steps.
1. Quality check before filtering using FastQC
. \
2. Trimming and filtering reads using fastp
. (Number of trimmed bases should match barcode length.) \
3. rRNA filtering with rRNAdust
& an rRNA blacklist (optional, e.g., human rDNA U13369.1). \
4. Quality check after filtering using FastQC
. \
5. Mapping reads using STAR
. \
6. Using VCF for variant-aware mapping (optional). \
7. Indexing the generated BAM file using samtools
. \
8. Calculating the alignment complexity using preseq
. \
9. Calculating alignment statistics using samtools
. \
10. Removing unmatched G additions (optional, recommended). \
- Identify and process reads with G additions on the 5' end using samtools
and bedtools
.
MultiQC
.PRIME
.[1.] The following software must be installed and permissions granted to execute them:
rRNAdust
(v1.02) Tool to filter abundant rRNA species.bedGraphToBigWig
(v4.0) Allows converting bedGraph to bigWig files.FastQC
(v0.12.1) Quality control tool for high throughput sequence data. fastp
(v0.23.4) A tool for quality control and filtering of sequencing data. STAR
(v2.7.3a) RNA-seq aligner for mapping reads to the genome. samtools
(v1.20.0) Tools for manipulating next-generation sequencing data in BAM format.preseq
(v2.0) For estimating library complexity.Perl
(v5.38.0) Required for running various scripts. openjdk
(v20.0.0) Java runtime environment.bedtools
(v2.31.0)) Set of tools for genome arithmetic.[2.] Paths for the following, pre-computed files need to be provided:
-g
, see Parameters). For instance, for the hg38 genome assembly, the corresponding human STAR genome index is prepared from the corresponding fasta and gtf files:
STAR \
--runMode genomeGenerate \
--runThreadN 3 \
--genomeDir hg38/STAR \
--genomeFastaFiles fasta/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
--sjdbGTFfile ./gencode.v43.chr_patch_hapl_scaff.annotation.gtf
-h Help message.
-f [STRING] Fastq.gz file(s). [required]
-g [STRING] Path to STAR index of reference genome. [required]
-b [INTEGER] Number of trimmed bases. [default=3]
-t [INTEGER] Number of threads used. [default=6]
-o [STRING] Output directory. [default=.]
-d [STRING] rDNA blacklist fasta file. [default=""]
-a [BOOL] Correct G additions. [default=true]
-v [BOOL] VCF path. [default=""]
./CAGE_STAR_mapping.sh \
-f *_001.fastq.gz \
-g /path/to/genome/STAR/ \
-t 3 \
-d /path/to/rDNA/blacklist/<blacklist>.fa \
-o . \
-a true
./CAGE_STAR_mapping.sh \
-f *R1_001.fastq.gz \
-f *R2_001.fastq.gz \
-g /path/to/genome/STAR/ \
-t 6 \
-d /path/to/rDNA/blacklist/<blacklist>.fa
-a false
To be added.