Xinglab / rmats-turbo

Other
228 stars 55 forks source link

Detection and quantification of events involving novel splicing junctions #277

Open ArashDepp opened 1 year ago

ArashDepp commented 1 year ago

Enjoying the great tool, but I have another question: One of my sample had following stats of used and unused reads (following the 2-pass alignment strategy with STAR)

read outcome totals across all BAMs USED: 84178584 NOT_PAIRED: 3698345 NOT_NH_1: 34262716 NOT_EXPECTED_CIGAR: 373242 NOT_EXPECTED_READ_LENGTH: 2487676 NOT_EXPECTED_STRAND: 0 EXON_NOT_MATCHED_TO_ANNOTATION: 1047715 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 192307 CLIPPED: 0 total: 126240585

So I have two questions:

  1. rMATs does not use reads with EXON_NOT_MATCHED_TO_ANNOTATION and JUNCTION_NOT_MATCHED_TO_ANNOTATION condition. But since rMATs can detect and quantify splicing events involving new splicing junctions and splice sites, should not it also use those reads as these are the ones informing the new events?? Could you please solve this confusion?

  2. Secondly, progress report from STAR says that 87.5% reads had unique mapping and 9.5% were multi mappers. Here is the snapshot

    image

But rMATs counts 34262716 as not being NOT_NH_1: Which means nearly 34262716 / 126240585 = 27% reads are multi-mappers. Why this number NOT_NH_1 so much off the STAR calculation (~9.5% in the image)? samtools stats BAMFILE shows that there were 101268968 paired reads and 97749854 were properly paired. So I am happy that rMATs can use >80 percent of the properly paired reads. But just that the number 34262716 does not fit into picture with what STAR outputs.

Could you please clarify these issues?

ArashDepp commented 1 year ago

Regarding second question: one guess is that rMATs is counting the number of alignments (as opposed to reads). Is that right or make sense? Because the total number reported by rMATs is more than the number of reads in fasta file.

EricKutschera commented 1 year ago
  1. rMATS will detect two kinds of novel events that correspond to these two output files: https://github.com/Xinglab/rmats-turbo/tree/v4.1.2#output
    • fromGTF.novelJunction.[AS_Event].txt: Alternative splicing (AS) events which were identified only after considering the RNA (as opposed to analyzing the GTF in isolation). This does not include events with an unannotated splice site.
    • fromGTF.novelSpliceSite.[AS_Event].txt: This file contains only those events which include an unannotated splice site. Only relevant if --novelSS is enabled.

The novelJunction events are those where all the exons are annotated in the gtf, but the combination of exons is different. In those cases rMATS can find exons and splice sites in the gtf that explain the alignment, but there is no transcript in the gtf which uses the exact sequence of exons

The novelSpliceSite events use at least 1 splice site which is not in the gtf. The splice site has to be found by looking at an existing exon-exon junction in the gtf and moving the splice site subject to these parameters: --mil (minimum intron length (defaults to 50)) and --mel (maximum exon length (defaults to 500)). Basically rMATS can only use an alignment with a novel splice site if it can define a new exon by making an adjustment to an annotated exon subject to those parameter values

  1. Yes rMATS is counting individual alignments. If one read has 5 total alignments then rMATS will report it 5 times as NOT_NH_1
ArashDepp commented 1 year ago

Hello Eric. Thanks a lot for your answer and input. So if I understood your response correctly and based on following stats:

EXON_NOT_MATCHED_TO_ANNOTATION: 1047715 JUNCTION_NOT_MATCHED_TO_ANNOTATION: 192307

So 1047715 + 192307 reads do not match the definition of exon and intron length and are discarded by rMATs. But rMATs is utilizing the other reads which do not match the exon and junction annotation provided they are within the defined exon/intron length limits.

Also do you think the number 1047715 is too large? (it is ~1% of the total reads/alignments)

EricKutschera commented 1 year ago

Yes, rMATS is utilizing other reads which do not match the exon and junction annotation provided they are within the defined exon/intron length limits if --novelSS was used (--novelSS is not on by default)

Also rMATS only attempts to find a novel splice site for an alignment if that alignment includes a splice junction. The JUNCTION_NOT_MATCHED_TO_ANNOTATION reads are the alignments that had a splice junction, but rMATS could not find a suitable novel splice site and so rMATS decided to discard those reads. The EXON_NOT_MATCHED_TO_ANNOTATION alignments didn't have a splice junction in their alignment, the alignment was not within an annotated exon, and rMATS discarded them without trying to find a novel splice site. The novel splice site parameters (--mil and --mel) only apply to alignments with junctions

The number of alignments not matched to the annotation seems fine at 1%

ArashDepp commented 1 year ago

Okay. Thanks again for taking time to clarify these doubts and appreciate your help 👍