anderssonlab / PRIMEprep

Preprocessing and mapping of CAGE sequencing data
GNU General Public License v3.0
1 stars 0 forks source link

PRIMEprep

Shell sript to preprocess single- or paired-end CAGE-sequencing data as generated by the SLIC-CAGE protocol (Cvetesic et al., 2018). The script requires fastq file(s) as input, trims off barcode sequences from the reads before mapping them to the corresponding genome with STAR and finally generates bigWig files that serve as input for downstream analysis with PRIME or CAGEfightR. Optional pre-processing steps include 1) filtering of reads derived from abundant, ribosomal RNAs via rRNAdust as recommended by the FANTOM consortium, 2) STAR mapping supported by personal variants that overlap the aigned reads for which vcf files would be required and 3) removal of Gs at the read's 5' end that don't match the genome sequence and are due to the 3' overhang produced by the Superscript reverse transcriptase during cDNA first-strand synthesis. Please see Overview of data processing steps for further details on all pre-processing steps.

Overview of data processing steps

1.    Quality check before filtering using FastQC. \ 2.    Trimming and filtering reads using fastp. (Number of trimmed bases should match barcode length.) \ 3.    rRNA filtering with rRNAdust & an rRNA blacklist (optional, e.g., human rDNA U13369.1). \ 4.    Quality check after filtering using FastQC. \ 5.    Mapping reads using STAR. \ 6.    Using VCF for variant-aware mapping (optional). \ 7.    Indexing the generated BAM file using samtools. \ 8.    Calculating the alignment complexity using preseq. \ 9.    Calculating alignment statistics using samtools. \ 10.   Removing unmatched G additions (optional, recommended). \       - Identify and process reads with G additions on the 5' end using samtools and bedtools.

Output directories & content

Dependencies

[1.] The following software must be installed and permissions granted to execute them:

[2.] Paths for the following, pre-computed files need to be provided:

Parameters

-h              Help message.
-f   [STRING]   Fastq.gz file(s).                         [required]
-g   [STRING]   Path to STAR index of reference genome.   [required]
-b   [INTEGER]  Number of trimmed bases.                  [default=3]
-t   [INTEGER]  Number of threads used.                   [default=6]
-o   [STRING]   Output directory.                         [default=.]
-d   [STRING]   rDNA blacklist fasta file.                [default=""]
-a   [BOOL]     Correct G additions.                      [default=true]
-v   [BOOL]     VCF path.                                 [default=""]

Example for single-end data with G-correction

./CAGE_STAR_mapping.sh \
   -f *_001.fastq.gz \
   -g /path/to/genome/STAR/ \
   -t 3 \
   -d /path/to/rDNA/blacklist/<blacklist>.fa \
   -o . \
   -a true

Example for paired-end data without G-correction

./CAGE_STAR_mapping.sh \
   -f *R1_001.fastq.gz \
   -f *R2_001.fastq.gz \
   -g /path/to/genome/STAR/ \
   -t 6 \
   -d /path/to/rDNA/blacklist/<blacklist>.fa
   -a false

Citation

To be added.