Xinglab / rmats-turbo

Other
228 stars 55 forks source link

Question about readLength parameter during reference creation and rMATS analysis #445

Open Motoufiq opened 2 weeks ago

Motoufiq commented 2 weeks ago

Hi,

I’m currently working on the RNA-Seq splicing analysis and have a question about parameter used to build reference STAR (https://github.com/alexdobin/STAR/tree/master) using GENCODE genome (https://www.gencodegenes.org/human/) which then would be related to input parameter for rMATS analysis. My paired end fastq files are 151 bp read length. As this is short read RNA-Seq splicing analysis with 151 bp, I was wondering if I should input --sjdboverhang 150 parameter was to build this STAR reference. NOTE: --sjdboverhang in the STAR specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database (in my case the PE read length is 151 bp), then my --sjdboverhang 150 (ReadLength-1: 151-1 = 150)

--runMode genomeGenerate
--genomeDir /path/to/genomeDir
--genomeFastaFiles /path/to/genome/fasta
--sjdbGTFfile /path/to/annotations.gtf
--sjdbOverhang ReadLength-1

I want to set the below paramater in rMATS:

--readLength READLENGTH The length of each read. Required parameter, with the value set according to the RNA-seq read length

It seems It’s quite common for Illumina sequencers to produce paired-end reads with a read length of 151 bp instead of the expected 150 bp. Is it OK to build STAR index --sjdboverhang 150, and does rMATS take input of read length of 151 base pair? or to build STAR index --sjdboverhang 149 instead and rMATS take input of read length of 150 bp ?

Additionally, if I don't input --sjdboverhang paramter, while building STAR INDEX, can I run RNA-Seq splicing analysis, and splicing analysis makes sense?

Thank you, Toufiq

EricKutschera commented 2 weeks ago

If you have 151bp reads then ideally you would create the STAR reference with --sjdbOverhang 150 and then run rMATS with --readLength 151. From the STAR manual, the exact value of --sjdbOverhang doesn't seem to matter too much: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value

For rMATS, --readLength needs to be the actual read length or the reads will get filtered out. However if you use --variable-read-length then the exact value of --readLength doesn't matter too much. This post has details about --readLength in rMATS: https://github.com/Xinglab/rmats-turbo/issues/83#issuecomment-766845772

Motoufiq commented 2 weeks ago

@EricKutschera thank you for the response.