comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
261 stars 62 forks source link

Is Alternative First the most common alternative splicing event? #76

Closed niradsp closed 4 years ago

niradsp commented 4 years ago

What I am seeing in the ioe file (generated from hg19 gtf file) is that Alternative First exons account for 42% of the splicing events. Skipped event exons follow at 23%. I was under the assumption that skipped event would be the most common event found in the annotation file? I was basing this on rmats data. Since rmats does not have alternative first, I wanted to make sure that AF is the most common type with suppa2.

EduEyras commented 4 years ago

Thanks for your message

Most calculations of the proportions of event types do not include alternative first usually because they are based on methods that do not include this type.

Older analyses (prior to RNA-seq) were based on less transcript-rich annotations or incomplete transcripts. People mostly focused on internal events.

Also, many people do not consider them as alternative splicing events sensu stricto, since they are coupled to TSS selection.

RNA-seq with the help of CAGE tags, and previously full-length mRNA sequencing projects (capped to polyA-tail) have brought up all this variability in the annotation.

In many of our studies, we find that expression changes are often driven by an isoform switch with a change in an alternative first exon, sometimes coupled with events downstream.

Perhaps this clarifies a bit your surprise?

best

E.

On Sat, 18 Jan 2020 at 00:56, niradsp notifications@github.com wrote:

What I am seeing in the ioe file (generated from hg19 gtf file) is that Alternative First account for 42% of the splicing events. Skipped event follows at 23%. I was under the assumption that skipped event would be the most common event found in the annotation file?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76?email_source=notifications&email_token=ADCZKB4ZX5DRSPIIDRVIGQ3Q6G2JBA5CNFSM4KIIUZVKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IG6ETZQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB42REEPM6EOKY7EG4DQ6G2JBANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

Thank you so much for the explanation. A5 as the most frequent event type is then to be expected. I have another question. I have annotation file that appears to be in UCSC format. You recommend using the pool-gene option if annotation is UCSC or Refseq. When I use pooling, it appears that some rownames have multiple gene ids(e.g. geneid1234_and_geneid4457). Is that expected? Close to 30% of the differential splicing genes had multiple gene ids. It appears that if I don't use pooling, I don't see multiple ids. Also, how do I go about finding novel splicing events? The gtf file that I have is all annotated. When I compare the PSI data with IOE, all transcript IDs exist. Here is what I was thinking. Run stringtie to assemble a GTF file, then use this GTF file with SUPPA2. Would this pipeline work without any issues?

EduEyras commented 4 years ago

Thank you so much for the explanation. A5 as the most frequent event type is then to be expected.

For SUPPA that depends on the annotation. Also, novel A5s might not be novel but maybe noisy splicing.

I have another question. I have annotation file that appears to be in UCSC format. You recommend using the pool-gene option if annotation is UCSC or Refseq. When I use pooling, it appears that some rownames have multiple gene ids(e.g. geneid1234_and_geneid4457). Is that expected? Close to 30% of the differential splicing genes had multiple gene ids. It appears that if I don't use pooling, I don't see multiple ids.

pool events will redefine genes as bags of transcripts that live on the same strand and share at least one exon splice-site. If two transcripts from the same gene have different gene labels in UCSC, those gene labels get combined.

Also, how do I go about finding novel splicing events? The gtf file that I have is all annotated. When I compare the PSI data with IOE, all transcript IDs exist. Here is what I was thinking. Run stringtie to assemble a GTF file, then use this GTF file with SUPPA2. Would this pipeline work without any issues?

Yes! as long as the GTF is standard. And can use the TPMs from the predicted transcripts.

best

E.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76?email_source=notifications&email_token=ADCZKB33ZMGZKWOA7IJBP73Q6G5KHA5CNFSM4KIIUZVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJH2OOA#issuecomment-575645496, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB6TS56NB6LKDTUEETLQ6G5KHANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

Thank you! Final question, is there a way to visualize splicing events? I was hoping for a Sashimi-like plot. For example, MISO comes with Sashimi Plot, and there is a program called rmats2sashimi for rmats.
I also know that IGV has a Sashimi plugin. Can the coordinates in the diffsplice data be directly used with IGV?

EduEyras commented 4 years ago

We don’t have that because suppa does not use the read data (a bam), but suppa produces gtfs for the events to visualize in the ucsc browser

There is an easy 1-to-1 mapping between suppa ids and rmats ids, at lease for se and a5/a3 as far as I recall, so it should be fairly straightforward to reuse the same script to visualize sashimis

However, as suppa does not use the bam, you may see discrepancies between the bam and what suppa predicted from the tpms

Unless of course you used the bam to calculate the tpms, as you indicated with stringtie

I hope this helps

Cheers

E

On Sat, 18 Jan 2020 at 01:37, niradsp notifications@github.com wrote:

Thank you! Final question, is there a way to visualize splicing events? I was hoping for a Sashimi-like plot. For example, MISO comes with Sashimi Plot, and someone created a program called rmats2sashimi for rmats. I also know that IGV has a Sashimi plugin. Can the coordinates in the diffsplice data be directly used with IGV?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76?email_source=notifications&email_token=ADCZKB3HPVKPEGS3AJSPJRLQ6G7CJA5CNFSM4KIIUZVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJH342Y#issuecomment-575651435, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB5DUO7MKTMSQERMVX3Q6G7CJANCNFSM4KIIUZVA .

--

niradsp commented 4 years ago

My current pipeline is using RSEM (which uses STAR to align the data first) to get the TPM value rather than Salmon. The BAM file is a transcript.bam file rather than genome BAM. Edit: Actually, there should be two bam files generated, one for transcript and the other for genome. But for some reason the genome bam file is 0 bytes (which is an issue with RSEM): https://github.com/alexdobin/STAR/issues/750

EduEyras commented 4 years ago

Hi, the transcript bam won't allow you to do sashimi plots.

On Sat, 18 Jan 2020 at 02:03, niradsp notifications@github.com wrote:

My current pipeline is using RSEM (which uses STAR to align the data first) to get the TPM value rather than Salmon. The BAM file is a transcript.bam file rather than genome BAM.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76?email_source=notifications&email_token=ADCZKB5YGQ2AXKPTPN76N5TQ6HCFJA5CNFSM4KIIUZVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJH6PGQ#issuecomment-575661978, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB7QRX4FCU55K7FAMV3Q6HCFJANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

Thanks. I am in the process of regenerating the BAM files. For identifying isoform switches with SUPPA2, do I use the DTU option? It appears that DTU seems to have identified more transcripts than the splicing option, which is expected based on the tutorial. In your cancer splicing paper, you mention that you define switch as a pair of transcripts. How do I figure out these switches? Did you look at the sign of delta psi? So for example, if a transcript is showing delta psi of -0.2, and another transcript (of the same gene) is showing delta psi of 0.2 (both of which have significant p value), do you define that as a switch? In other words, the absolute value of delta psi needs to be equal (even if approximately), but the sign is different? I have noticed many transcript with opposite signs, which obviously has to be a switch. Now, there will be obviously cases where the switch isn't exactly -0.2 to +0.2. So perhaps I should look at difference in sign of dPSI and whether the transcripts are significantly spliced or not based on the p value. Second, for figuring out the actual splicing events responsible for these switches, do I do an overlap analysis with the files generated by using the splicing command? I am decided on using SUPPA2, but still trying to figure out the type of analysis I should conduct (whether I should look at DTU as well as splicing, figure out the switches, along with the splicing events, etc). Thank you for all your help. Nirad

EduEyras commented 4 years ago

Hi Nirad,

thanks for the email.

On Sun, 19 Jan 2020 at 01:18, niradsp notifications@github.com wrote:

Thanks. I am in the process of regenerating the BAM files. For identifying isoform switches with SUPPA2, do I use the DTU option?

Yes, that is one option. How many samples do you have per condition?

It appears that DTU seems to have identified more transcripts than the splicing option. In your cancer splicing paper, you mention that you define switch as a pair of transcripts. How do I figure out these switches? Did you look at the sign of delta psi? So for example, if a transcript is showing delta psi of -0.2, and another transcript (of the same gene) is showing delta psi of 0.2 (both of which have significant p value), do you define that as a switch? In other words, the absolute value of delta psi needs to be equal (even if approximately), but the sign is different?

If you have only a few samples per condition, that would be a way to look for switches. The |deltaPSI| does not need to be equal. As you mention, there are many transcripts with DTU, so often the signal gets split across multiple transcripts.

Second, for figuring out the actual splicing events responsible for these switches, do I do an overlap analysis with the files generated by using the splicing command?

In (Climente-Gonzalez et al. 2017) we showed that with isoforms you find more possible variations than those that can be described in terms of the events. So it may not always be that there is an event associated with a switch.

To check the events associated with switches you can check the last two columns in .ioe file and look for events that have only one of the isoforms in the switch appearing in the 1st of those columns, and the two isoforms in the 2nd column.

I am decided on using SUPPA2, but still trying to figure out the type of analysis I should conduct (whether I should look at DTU as well as splicing, figure out the switches, along with the splicing events, etc). Thank you for all your help. Nirad

You can do both, and then see whether you find something in DTU that is not in the events. To give you an example, in (Sebestyen et al. NAR 2015) we describe a switch in QKI that happens in lung adenocarcinoma. This switch is important and has been validated experimentally validated here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423218/ but it cannot be described in terms of a standard event. In fact, it was not observed here https://www.ncbi.nlm.nih.gov/pubmed/21041478 because they used microarrays with the standard events.

I hope it helps

E.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76?email_source=notifications&email_token=ADCZKB3QI5ELIHM46TQWMWDQ6MFSRA5CNFSM4KIIUZVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJJZEOI#issuecomment-575902265, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB73AFOFCRMFK22YFQTQ6MFSRANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

Hello Eduardo, Thank you for the response. Right now, we are testing on a pilot study. So currently we have 2 samples per condition. However, the actual study I think may possibly have 60 samples. Furthermore, I am going to try to incorporate DTU/Splicing analysis on any RNA-Seq data that I receive from various projects. Do you also think that Suppa2 works fine with Oxford Nanopore? In this case, I will probably use Salmon with long read option, and then run SUPPA2 on the TPM data. You mentioned that with a few samples, the above strategy works. I will look for opposite signs and see if there is significance in terms of p-value. What would the strategy be, for say, 100s of samples per condition?
Also, let's suppose that I find two or more pairs of isoform switches in a gene. What would you do in that case? Discard the pair(s) with lower significance and keep the other? I think in your 2017 cancer paper, you kept just the most significant pair?

Thanks in advance, Nirad

EduEyras commented 4 years ago

Hi,

sorry for the late reply. Your plan sounds really cool.

For quantification with Nanopore you can also use now our new method: https://github.com/comprna/RATTLE

We have not adapted it yet to work with multiple samples, i.e., defining the same transcript or gene variables across samples, but will do so soon, and then it should be easy to link to SUPPA.

For reference-based studies with Nanopore, rather than using Salmon-LR, you could try FLAIR or StringTie2, they work reasonably well, and will give you GTFs and TPMs for known and novel transcripts.

Then you can apply directly SUPPA for events or for transcripts.

You could compare the two sample groups as usual with the Wilcoxon test.

You could also use the SPADA software (https://github.com/hclimente/spada) which is the reincarnation of smartas, optimized and streamlined. It should allow you to define the groups, and calculate a test per sample, e.g. each tumor vs the pool of controls. If that makes sense in your setting. That would be really interesting to do.

Best

Eduardo

On Mon, 20 Jan 2020 at 01:28, niradsp notifications@github.com wrote:

Hello Eduardo, Thank you for the response. Right now, we are testing on a pilot study. So currently we have 2 samples per condition. However, the actual study I think may possibly have 60 samples. Furthermore, I am going to try to incorporate DTU/Splicing analysis on any RNA-Seq data that I receive from various projects. Do you also think that Suppa2 works fine with Oxford Nanopore? In this case, I will probably use Salmon with long read option, and then run SUPPA2 on the TPM data. You mentioned that with a few samples, the above strategy works. I will look for opposite signs and see if there is significance in terms of p-value. What would the strategy be, for say, 100s of samples per condition? Also, let's suppose that I find two or more pairs of isoform switches in a gene. What would you do in that case? Discard the pair(s) with lower significance and keep the other? I think in your 2017 cancer paper, you kept just the most significant pair?

Thanks in advance, Nirad

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76?email_source=notifications&email_token=ADCZKB2G5GOEB5BHIQE4PPDQ6RPPJA5CNFSM4KIIUZVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJKTPMQ#issuecomment-576010162, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB44SUF4NLVL7L2NVC3Q6RPPJANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

Hello Eduardo, Thank you for your help. I have around 60 samples: ~30 control and 30 case. Each sample is 400 million reads each--so these are high read samples. I ran Suppa2 by dividing the tpm data into control and case. The total number of significant events that I am getting is only in the 30's with p-value adjusted of 0.05 and dPSI>0.05. This is essentially the command I used (i.e. the one in the tutorial). Anything you would change below?

python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./annotation/ensembl_hg19.events.ioe -p TRA2_KD_events.psi TRA2_NC_events.psi -e TRA2_KD_iso.tpm TRA2_NC_iso.tpm -o TRA2_diffSplice

For isoform: python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./ensembl_hg19.isoforms.ioi -p ./TRA2_KD_iso.psi ./TRA2_NC_iso.psi -e ./TRA2_KD_iso.tpm ./TRA2_NC_iso.tpm -o ./TRA2_diffSplice_iso

The parameters above may be too strict, so I am wondering what other parameter I should tweak. Should I use the -l option, and use something like -l 0.05? Furthermore, can this parameter be used with both isoform and event commands I highlighted above?

EduEyras commented 4 years ago

Hi,

There are very few significant cases because the -m empirical will find events that are significantly above the observed variability, which is often a lot. You could try -m classical, which performs a Mann-Whitney test

The -l by default is 0, which will test all deltaPSIs. What you could do is to increase this threshold, e.g. -l 0.05 so that every event / DTU below thiis deltaPSI will be thrown away. Then you have less tests. When you correct for multiple testing the -m classical results, then there are fewer tests to correct, and you may improve your detection power since you are basically throwing away things that you know a priori can never be significant.

Let me know if this helps

cheers

Eduardo

On Tue, 17 Mar 2020 at 09:47, niradsp notifications@github.com wrote:

Hello Eduardo, Thank you for your help. I have around 60 samples ~30 control and 30 case. Each sample is 400 million reads each--so these are high read samples. I ran Suppa2 by dividing the tpm data into control and case. I used the following parameters. The total number of significant events that I am getting is only in the 30's with p-value adjusted of 0.05 and dPSI>0.05. This is essentially the command I used (i.e. the one in the tutorial). Anything you would change below?

python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./annotation/ensembl_hg19.events.ioe -p TRA2_KD_events.psi TRA2_NC_events.psi -e TRA2_KD_iso.tpm TRA2_NC_iso.tpm -o TRA2_diffSplice

For isoform: python ~/SUPPA/suppa.py diffSplice -m empirical -gc -i ./ensembl_hg19.isoforms.ioi -p ./TRA2_KD_iso.psi ./TRA2_NC_iso.psi -e ./TRA2_KD_iso.tpm ./TRA2_NC_iso.tpm -o ./TRA2_diffSplice_iso

The parameters above may be too strict, so I am wondering what other parameter I should tweak. Should I use the -l option, and use something like -l 0.05? Furthermore, can this parameter be used with both isoform and event commands I highlighted above?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76#issuecomment-599791941, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB7VKZZY7EU4SVDG4FDRH2T7BANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

Hello Eduardo, I am now getting the following error message. Any idea why?
Calculating differential analysis between conditions: non_PAD_events and PAD_events /usr/local/Anaconda/envs_app/suppa/2.3/lib/python3.6/site-packages/scipy/stats/stats.py:5700: RuntimeWarning: divide by zero encountered in double_scalars z = (bigu - meanrank) / sd ERROR:main:Unknown error: (<class 'TypeError'>, TypeError("calculate_delta_psi() missing 1 required positional argument: 'nan_th'",), <traceback object at 0x2aaad1438c48>)

My command is as follows: suppa.py diffSplice --method classical -i /data/banskotan2/Annotation/ensembl_hg19.events.ioe -p /data/banskotan2/suppa/events/non_P_events.psi /data/banskotan2/suppa/events/P_events.psi -e /data/banskotan2/suppa/non_P_tpm.txt /data/banskotan2/suppa/P_tpm.txt --lower-bound 0.05 -gc -o classical_P_events_diffsplice

The empirical command ran with no problem: suppa.py diffSplice -m empirical -gc -i ../Annotation/ensembl_hg19.events.ioe -p events/non_P_events.psi events/P_events.psi -e non_P_tpm.txt P_tpm.txt -o P_events_diffspliace --save_tpm_event

I tried changing a few things in the first command (for example -m to --method and -l instead of lower-bound) but for some reason, the program halts due to the divide by zero error.

Thanks, Nirad

niradsp commented 4 years ago

Actually, when I downloaded the github version, the program ran with no problem (the classical appears to be very fast--done in less than 5 minutes). This warning still appears. Is that okay? /home/banskotan2/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py:5700: RuntimeWarning: divide by zero encountered in double_scalars z = (bigu - meanrank) / sd

Nirad

EduEyras commented 4 years ago

Hi,

there might be events with sd = 0, and the stats.py package is complaining

This means that there may be events with low variation that you could still skip

Have you tried increasing the threshold for diffSplice --lower-bound ?

cheers

E.

On Wed, 18 Mar 2020 at 01:30, niradsp notifications@github.com wrote:

Actually, when I downloaded the github version, the program ran with no problem (the classical appears to be very fast--done in less than 5 minutes). This warning still appears. Is that okay? /home/banskotan2/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py:5700: RuntimeWarning: divide by zero encountered in double_scalars z = (bigu - meanrank) / sd

Nirad

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76#issuecomment-600102532, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBZHERUOJZHINHNWDILRH6CPJANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

I had the threshold set to 0.05. Should I increase to 0.10?

Thanks, Nirad

EduEyras commented 4 years ago

Ah, then it's strange that is complaining, as I would not expect to see sigma=0. Maybe you can try increasing the threshold, or taking a look first at those cases with no apparent variation. E.

On Wed, 18 Mar 2020 at 23:42, niradsp notifications@github.com wrote:

I had the threshold set to 0.05. Should I increase to 0.10?

Thanks, Nirad

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76#issuecomment-600600761, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBYJ62AHGM4SOZDFG6TRIC6SLANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

niradsp commented 4 years ago

I filtered TPM data with sd at 0. So looking at the filtered TPM data with sd 0, there are cases where the TPM value is flat out zero across all columns. The software must be complaining about these. So I think that we can safely ignore these. There are 34,000 of these out of 190,000, so these must be transcripts that could not be detected. Thank you for all your help.

Nirad

EduEyras commented 4 years ago

great! no problem E.

On Thu, 19 Mar 2020 at 00:13, niradsp notifications@github.com wrote:

I filtered TPM data with sd at 0. So looking at the filtered TPM data with sd 0, there are cases where the TPM value is flat out zero across all columns. The software must be complaining about these. So I think that we can safely ignore these. There are 34,000 of these out of 190,000, so these must be transcripts that could not be detected. Thank you for all your help.

Nirad

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/76#issuecomment-600614716, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB5KF6RM3M6BBZWILP3RIDCGPANCNFSM4KIIUZVA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ