Xinglab / rmats-turbo

Other
228 stars 55 forks source link

To identify differential alternative splicing between two groups #320

Open jingxian555 opened 1 year ago

jingxian555 commented 1 year ago

Thanks for your tools! There are 2 sample groups with 3 BAM files per group,but I am counfused with the results, how to filter for differential alternative splicing between the experimental group and the control group? I'm looking at this result file [AS_Event].MATS.JC.txt, can I directly filter differential alternative splicing based on FDR? Thanks!

The executed script is as follows: python /home/sunpf/my_data/software/miniconda3/envs/new_rmats/rMATS/rmats.py --b1 u8.txt --b2 WT.txt --gtf Arabidopsis_thaliana.TAIR10.57.gtf -t paired --readLength 150 --novelSS --od out --tmp tmp

EricKutschera commented 1 year ago

Yes, you can filter on FDR to find significant differential alternative splicing events between the two groups. This post mentions some potential filters for other columns in the output https://github.com/Xinglab/rmats-turbo/issues/183#issuecomment-1054318031

jingxian555 commented 1 year ago

Thank you for your suggestions. Here is the code of python for advanced filtering, Is my understanding correct? And are these filtering criteria necessary? After filtering based on these criteria, there are only 30 significant differential alternative splicing events remaining. Can I just filter based on FDR < 0.05 to find significant differential alternative splicing events between the two groups? Thanks!

fout = open('significant.SE.MATS.JC.txt', 'w')
with open('SE.MATS.JC.txt','r') as fin:
    line = fin.readline()
    for line in fin:
        line = line.strip()
        arr = line.split('\t')
        IJC_SAMPLE_1 = arr[12]
        SJC_SAMPLE_1 = arr[13]
        IJC_SAMPLE_2 = arr[14]
        SJC_SAMPLE_2 = arr[15]
        FDR = arr[-4]
        IncLevel1 = arr[-3]
        IncLevel2 = arr[-2]

        IJC_SAMPLE_1_l = [int(IJC1) for IJC1 in IJC_SAMPLE_1.split(',')]
        IJC_SAMPLE_1_avg = sum(IJC_SAMPLE_1_l)/len(IJC_SAMPLE_1_l)

        SJC_SAMPLE_1_l = [int(SJC1) for SJC1 in SJC_SAMPLE_1.split(',')]
        SJC_SAMPLE_1_avg = sum(SJC_SAMPLE_1_l)/len(SJC_SAMPLE_1_l)

        IJC_SAMPLE_2_l = [int(IJC2) for IJC2 in IJC_SAMPLE_2.split(',')]
        IJC_SAMPLE_2_avg = sum(IJC_SAMPLE_2_l)/len(IJC_SAMPLE_2_l)

        SJC_SAMPLE_2_l = [int(SJC2) for SJC2 in SJC_SAMPLE_2.split(',')]
        SJC_SAMPLE_2_avg = sum(SJC_SAMPLE_2_l)/len(SJC_SAMPLE_2_l)

        IncLevel1_l = [float(I1) for I1 in IncLevel1.split(',') if I1 != 'NA']
        IncLevel1_avg = sum(IncLevel1_l)/len(IncLevel1_l)
        IncLevel1_range = max(IncLevel1_l) - min(IncLevel1_l)

        IncLevel2_l = [float(I2) for I2 in IncLevel2.split(',') if I2 != 'NA']
        IncLevel2_avg = sum(IncLevel2_l)/len(IncLevel2_l)
        IncLevel2_range = max(IncLevel2_l) - min(IncLevel2_l)

        if (IJC_SAMPLE_1_avg + SJC_SAMPLE_1_avg >= 10) and (IJC_SAMPLE_2_avg + SJC_SAMPLE_2_avg >= 10) and (float(FDR) < 0.05) and (IncLevel1_range > 0.05) and (0.05 < IncLevel1_avg < 0.95) and (IncLevel2_range > 0.05) and (0.05 < IncLevel2_avg < 0.95):
            fout.write(line + '\n')

fout.close()
EricKutschera commented 1 year ago

Yes, you can filter just based on FDR

For the code, I think the checks for the inclusion levels mentioned in that other post were intended to be done with all samples. Instead of requiring the PSI value to differ within a sample group like IncLevel1_range > 0.05, I think the original intention was to check that the PSI value had some variation when looking at all samples

jingxian555 commented 1 year ago

I apologize, but I'm having difficulty understanding your request. Can you just change the code? Thank you so much!

jingxian555 commented 1 year ago

I saw in the article that evaluate splicing defects by comparing the percentage spliced-in (PSI) index of AS events (P < 0.05 and delta PSI > 3%) , Is delta PSI the same as abs(IncLevelDifference)?Thank you~

EricKutschera commented 1 year ago

I'm not sure what article you're referring to, but IncLevelDifference is a difference in PSI values. Taking the absolute value of IncLevelDifference and calling that delta PSI is reasonable

For the code change

fout = open('significant.SE.MATS.JC.txt', 'w')
with open('SE.MATS.JC.txt','r') as fin:
    line = fin.readline()
    for line in fin:
        line = line.strip()
        arr = line.split('\t')
        IJC_SAMPLE_1 = arr[12]
        SJC_SAMPLE_1 = arr[13]
        IJC_SAMPLE_2 = arr[14]
        SJC_SAMPLE_2 = arr[15]
        FDR = arr[-4]
        IncLevel1 = arr[-3]
        IncLevel2 = arr[-2]

        IJC_SAMPLE_1_l = [int(IJC1) for IJC1 in IJC_SAMPLE_1.split(',')]
        IJC_SAMPLE_1_avg = sum(IJC_SAMPLE_1_l)/len(IJC_SAMPLE_1_l)

        SJC_SAMPLE_1_l = [int(SJC1) for SJC1 in SJC_SAMPLE_1.split(',')]
        SJC_SAMPLE_1_avg = sum(SJC_SAMPLE_1_l)/len(SJC_SAMPLE_1_l)

        IJC_SAMPLE_2_l = [int(IJC2) for IJC2 in IJC_SAMPLE_2.split(',')]
        IJC_SAMPLE_2_avg = sum(IJC_SAMPLE_2_l)/len(IJC_SAMPLE_2_l)

        SJC_SAMPLE_2_l = [int(SJC2) for SJC2 in SJC_SAMPLE_2.split(',')]
        SJC_SAMPLE_2_avg = sum(SJC_SAMPLE_2_l)/len(SJC_SAMPLE_2_l)

        IncLevel1_l = [float(I1) for I1 in IncLevel1.split(',') if I1 != 'NA']
        IncLevel2_l = [float(I2) for I2 in IncLevel2.split(',') if I2 != 'NA']

        all_IncLevel = IncLevel1_l + IncLevel2_l
        IncLevel_avg = sum(all_IncLevel)/len(all_IncLevel)
        IncLevel_range = max(all_IncLevel) - min(all_IncLevel)

        if (IJC_SAMPLE_1_avg + SJC_SAMPLE_1_avg >= 10) and (IJC_SAMPLE_2_avg + SJC_SAMPLE_2_avg >= 10) and (float(FDR) < 0.05) and (IncLevel_range > 0.05) and (0.05 < IncLevel_avg < 0.95):
            fout.write(line + '\n')

fout.close()
jingxian555 commented 10 months ago

Thank you for your previous advice; it was very helpful. Now, I have a new question. Is the '--b1' parameter for the BAM file of the experimental group? For example, in this "SE.MATS.JC.txt" file, does IncLevelDifference > 0 indicate a higher probability of SE events occurring in group b1 compared to b2? Thank you!

The executed script is as follows: python /home/sunpf/my_data/software/miniconda3/envs/new_rmats/rMATS/rmats.py --b1 u8.txt --b2 WT.txt --gtf Arabidopsis_thaliana.TAIR10.57.gtf -t paired --readLength 150 --novelSS --od out --tmp tmp

EricKutschera commented 10 months ago

From the README: https://github.com/Xinglab/rmats-turbo/tree/v4.2.0#output

IncLevelDifference: average(IncLevel1) - average(IncLevel2)

IncLevelDifference > 0 indicates that IncLevel1 > IncLevel2. Since IncLevel1 corresponds to --b1 and IncLevel2 to --b2, that would mean a higher inclusion level in --b1 as compared to --b2. For SE events a higher inclusion level means that the exon is included more often

Talking about events occurring in one group or the other can be confusing as mentioned in this post: https://github.com/Xinglab/rmats2sashimiplot/issues/68#issuecomment-910283252

jingxian555 commented 9 months ago

Thank you. Is rMATS used for alternative splicing analysis with uniquely mapped reads in BAM files?

EricKutschera commented 9 months ago

Yes, rMATS only uses uniquely mapped reads. rMATS will write a section to stdout saying how many reads were filtered out for different reasons. In that section NOT_NH_1 means the read was filtered out because it was not uniquely mapped. Here's a related post: https://github.com/Xinglab/rmats-turbo/issues/293

jingxian555 commented 9 months ago

Thank you!

  1. The gene AT5G50100 has only one transcript with 8 exonic regions. Why is there a discrepancy with the coordinate information in RI.MATS.JC.txt? exonic regions: AT5G50100

RI.MATS.JC.txt:

PSI
  1. Below is the image generated by rmats2sashimiplot, where the numbers on the image represent the read count on the junctions. Can you provide more detailed explanations? AS
EricKutschera commented 9 months ago

The rmats command you posted before included --novelSS. With that option rmats can detect events with splice sites that are not in the --gtf: https://github.com/Xinglab/rmats-turbo/issues/277#issuecomment-1486991997

rmats2sashimiplot doesn't show counts for novel junctions: https://github.com/Xinglab/rmats2sashimiplot/blob/v3.0.0/src/MISO/misopy/sashimi_plot/plot_utils/plot_gene.py#L148 This post discusses differences in read counts between rmats and rmats2sashimiplot https://github.com/Xinglab/rmats2sashimiplot/issues/33#issuecomment-656720560

jingxian555 commented 8 months ago

Thanks! I removed the --novelSS from the rmats command and filtered significant RI using these filters:

  1. Average PSI (IncLevel) within 0.05 and 0.95
  2. Average total read count (inclusion count + skipping count) >= 10
  3. max(PSI) – min(PSI) > 0.05
  4. FDR < 0.05
  5. abs(IncLevelDifference) > 0.2

The significant RI events I found all follow a pattern. For example, in the image below, the red boxes indicate significant RI. However, these significant RI regions belong to exonic regions in another transcript. That is, there are overlapping regions of exons and introns in different transcripts. Is this due to differences in exonic regions causing differences in intron retention?

RI

The executed script is as follows: python /home/sunpf/my_data/software/miniconda3/envs/new_rmats/rMATS/rmats.py --b1 u8.txt --b2 WT.txt --gtf Arabidopsis_thaliana.TAIR10.57.gtf -t paired --readLength 150 --od out --tmp tmp

EricKutschera commented 8 months ago

rMATS relies on the --gtf for RI events. Unless --novelSS is used, all of the RI events reported by rMATS should have the intron region overlap an annotated transcript in an exon region. This post has some details: https://github.com/Xinglab/rmats-turbo/issues/17