Zhe-jiang / PRAISE

Bioinformatics guide and scripts for PRAISE, a quantitative pseudouridine sequencing method
4 stars 0 forks source link

PRAISE: A quantitative pseudoridine sequencing method

This is the bioinformatics guide and scripts for PRAISE, a quantitative pseudouridine sequencing method. Our paper is published on Nature Chemical Biology. \ For all the codes shown in this document, the parameters are the same as our published data. You can change suitable parameters for yourself.

0 Requirments

1 Overview of the bioinformatics pipeline of PRAISE

This part is a detailed bioinformatics guide for our quantitative pseudoridine sequencing method: PRAISE.

1.1 Prepare your reads

Here, we provide optimized reads pre-processing for three different library: eCLIP, KAPA, and Takara (A modified Takara library method, see experimental details in our Paper).

1.1.1 Takara Library

Takara Library is suitable for quantitative purposes of RNA with regular length (e.g. mRNA sequencing).

1.1.2 KAPA Library

Takara Library is suitable for semi-quantitative purposes of RNA with regular length (e.g. mRNA sequencing). From our data, the KAPA library result is pretty consistant with the Takara library result in whole transcriptome pseudouridine sequencing using PRAISE. However, if you want to get precise quantitative result, we still recomend Takara library since it has UMI for deduplication.

1.1.3 eCLIP Library

eCLIP Library is suitable for quantitative purposes of RNA with short length (e.g. tRNA sequencing).

1.1.4 Other libraries

1.2 Mapping your reads


IMPORTANT MESSAGES:

1.2.1 Mapping with realignment

The realignment will give you a more precise result but require an amount of computational resources. If you want a precise quantitative result, especially on genome-wide sequencing, it is better to do realignment.

hisat2 -p {CORES} \
-x {INDEX_HISAT2} \
-q --repeat --no-spliced-alignment --very-sensitive \
-U {input.processed_R2} \
-S {output.SAM}
# {CORES}: core number used
# {INDEX_HISAT2}: index needed for hisat2
# {input.processed_R2}: pre-processed read 2 file (.fq.gz)
# {output.SAM}: output SAM file (.SAM)
samtools sort -O BAM -o {output.bam} -@ {CORES} -m {MEM} -T {output.bam_temp} {input.sam}
# {CORES}: core number used
# {MEM}: Memory used per core (e.g. 2G)
# {output.bam}: output sorted BAM file (.BAM)
# {output.bam.temp}: temperory file
# {input.sam}: input SAM file (.SAM)
samtools index {input.bam} {output.bai}
# {input.bam}: Input sorted BAM file (.BAM)
# {output.bai}: Output index file (.BAM.BAI)
samtools view -h -@ {CORES} -bF 4 {input.bam_sorted} -o {output.bam_mapped}
# {CORES}: core number used
# {input.bam_sorted}: Input sorted BAM file (.BAM)
# {output.bam_mapped}: Output BAM file without unmapped reads (.BAM)
samtools sort -@ {CORES} -m {} -O BAM -n -o {output.bam_name_sorted} -T {output.bam_name_sorted.temp} {input.bam_mapped}
# {CORES}: core number used
# {MEM}: Memory used per core (e.g. 2G)
# {output.bam_name_sorted}: output name sorted BAM file (.BAM)
# {output.bam_name_sorted.temp}: temporary file
# {input.bam_mapped}: input BAM file without unmapped reads (.BAM)
python {Realign_script} --fast -t {CORES} -ms 4.8 \
-x {Reference} -i {input.bam_name_sorted} -o {output.bam_realigned} -f {output.bam_filtered}
# {Realign_script}: use realignment_forward.py for reads aligned to the forward sequences, use realignment_reverse.py for reads aligned to the reverse sequences
# {CORES}: core number used
# {Reference}: Reference used (.fa, same reference as the hisat2 index created from)
# {input.bam_name_sorted}: Output name sorted BAM file (.BAM)
# {output.bam_realigned}: Result BAM file of the realignment (.BAM)
# {output.bam_filtered}: BAM file contains all filtered reads (.BAM)

1.2.2 Mapping without realignment

1.3 Count your BAM file and call signals


Important messages:

1.3.1 Pipeline used in our paper for counting BAM file

samtools mpileup -d {depth} -BQ0 -f {REF} {input.bam} -o {output.mpileup} --ff UNMAP,QCFAIL -aa
# {depth}: max depth, 15000000 was used
# {REF}: reference, use the reference you used in mapping (.fasta)
# {input.bam}: input BAM file after mapping (.bam, need to be sorted)
# {output.mpileup}: output MPILEUP file (.mpilup)
python {Count_script} -p {CORES} -i {input.mpileup} -o {output.bmat}
# {Count_script}: use parse-mpileup.py
# {CORES}: core number used
# {input.mpileup}: input MPILEUP file (.mpileup)
# {output.bmat}: output BMAT file (.bmat)