NCI-RBL / iCLIP

RNA Biology Pipeline to Characterize protein-RNA Interactions
https://rbl-nci.github.io/iCLIP/
MIT License
4 stars 2 forks source link

determine the impact of seedPerwindowNmax on samples #119

Closed slsevilla closed 2 years ago

slsevilla commented 2 years ago

While processing RBL samples, errors were encountered relating to the seedPerwindowNmax value (set at 10000). To determine the impact of this value on multi-mapping stats, two samples were run at iterations.

Samples

seedPerwindowNmax values

Example code:

#!/bin/bash
module load STAR; \
sample_id="Ro7hr2_Clip"; \
seedPerWindowNmax="10";\
sample_id_prefix="${sample_id_prefix}_"; \
tmp_dir="/lscratch/${SLURM_JOB_ID}"; \
export tmp_dir; \
STAR --runMode alignReads --genomeDir /data/CCBR_Pipeliner/iCLIP/index/active/2022_0505/mm10/index \
--sjdbGTFfile /data/CCBR_Pipeliner/iCLIP/index/active/2022_0505/mm10/ref/gencode.vM23.annotation.gtf --readFilesCommand zcat \
--readFilesIn /data/RBL_NCI/Wolin/mESC_clip_4_v2.0/01_preprocess/01_fastq/$sample_id.fastq.gz \
--outFileNamePrefix $tmp_dir/${sample_id_prefix} \
--outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --alignEndsType Local --alignIntronMax 50000 --alignSJDBoverhangMin 3 \
--alignSJoverhangMin 5 --alignTranscriptsPerReadNmax=10000 --alignWindowsPerReadNmax=10000 --outFilterMatchNmin 15 --outFilterMatchNminOverLread 0.9 \
--outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --outFilterMultimapNmax 10000 --outFilterMultimapScoreRange 0 --outFilterScoreMin 0 \
--outFilterType Normal --outSAMattributes All --outSAMunmapped None --outSJfilterCountTotalMin 3 1 1 1 --outSJfilterOverhangMin 30 12 12 12 \
--outSJfilterReads All --seedMultimapNmax 10000 --seedNoneLociPerWindow 20 --seedPerReadNmax 10000 --seedPerWindowNmax $seedPerWindowNmax --sjdbScore 2 --winAnchorMultimapNmax 10000

 # move STAR files and final log file to output
 mv $tmp_dir/${sample_id_prefix}Aligned.sortedByCoord.out.bam /data/sevillas2/star/$seedPerWindowNmax/${sample_id}.bam
 mv $tmp_dir/${sample_id_prefix}Log.final.out /data/sevillas2/star/$seedPerWindowNmax/${sample_id}.out

 # move mates to unmapped file
 touch /data/sevillas2/star/$seedPerWindowNmax/${sample_id}.unmapped.out
 for f in $tmp_dir/${sample_id_prefix}Unmapped.out.mate*; do cat $f >> /data/sevillas2/star/$seedPerWindowNmax/${sample_id}.unmapped.out; done
slsevilla commented 2 years ago

Overview

Analysis

Unique reads

Spliced TOTAL number of reads

Spliced ANNOTATED number of reads

Multi-mapped reads

Unmapped reads - unmapped due to 'too many reads in windows'

Unmapped reads - unmapped due to 'other'

Recommendation

The recommendation for the parameters are as follows -

These values will allow us to align samples without overstretching our resources quickly, while still maintaining a high number of unique reads, and annotated splice junctions.