comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
247 stars 59 forks source link

Multiple Comparisons and Clustering #113

Closed danphillips28 closed 2 years ago

danphillips28 commented 3 years ago

Dear SUPPA2 developers,

I am trying to use suppa to compare multiple groups to the same control group, i.e;

Group1 - Control Group2 - Control Group3 - Control Group4 - Control etc...

Ideally, I would like to do this all at once i.e. in a single diffSplice script. However, if I understand correctly (and this is my question: Is my understanding correct?), the format of the .tpm and (event).psi inputs limit me to the following options, none of which are quite what I need (using .psi to illustrate);

1. Control Group1 Group2 Group3 Group4 event1 ## This will test Group1 - Control, Group2 - Group1, ... event2 ## or with --combination event3 ## Round-robin comparisons, including e.g. Group1 - Group3

  1. Control Group1 Control Group2 event1 ## This will test Group1 - Control, Control - Group - Group1, ... event2 ## or with --combination event3 ## Round-robin, including e.g. Control - Control, Control - Group1...

Is there a way to do pairwise tests as listed at the start of my question? I have considered doing e.g. Method [1] above with --combination, and then removing columns such that only the ones I want remain, but this seems a little "hacky", and I anticipate might cause issues downstream.

I know I can do these pairwise tests individually to avoid this issue, but I want the results in single dpsi and psivec files, so that I can do clustering of events across;

Control Group1 Group2 Group3 ...

FYI: Groups are not strictly time-series, hence why I don't want to test e.g. Group3-Group2-Group1, but they represent gradations of a treatment, so it would be valuable to look for clustering trends across the levels.

Please let me know if there is any solution to this. Apologies if I have overlooked something glaringly obvious - I am a beginner!

Thank you, Daniel

EduEyras commented 3 years ago

Hi,

thanks for your email.

It's true that sometimes you want to do a comparison to a common condition. Although we did that in other methods, it is not implemented like that in SUPPA.

The combination flag will perform all comparisons, including the ones you're after. Alternatively, if you do the comparisons separately, you will only need to join together the results. I think the latter will be simpler and easier to troubleshoot.

The linking is easy because all files will have the same set of event ids, so you can merge them by ids. And then you can use this as psivec for clustering.

I hope this helps

Eduardo

On Wed, 9 Dec 2020 at 13:19, danphillips28 notifications@github.com wrote:

Dear SUPPA2 developers,

I am trying to use suppa to compare multiple groups to the same control group, i.e;

Group1 - Control Group2 - Control Group3 - Control Group4 - Control etc...

Ideally, I would like to do this all at once i.e. in a single diffSplice script. However, if I understand correctly (and this is my question: Is my understanding correct?), the format of the .tpm and (event).psi inputs limit me to the following options, none of which are quite what I need (using .psi to illustrate);

1.

Control Group1 Group2 Group3 Group4 event1 ## This will test Group1 - Control, Group2 - Group1, ... event2 ## or with --combination event3 ## Round-robin comparisons, including e.g. Group1 - Group3

1.

Control Group1 Control Group2 event1 ## This will test Group1 - Control, Control - Group - Group1, ... event2 ## or with --combination event3 ## Round-robin, including e.g. Control - Control, Control - Group1...

Is there a way to do pairwise tests as listed at the start of my question? I have considered doing e.g. Method [1] above with --combination, and then removing columns such that only the ones I want remain, but this seems a little "hacky", and I anticipate might cause issues downstream.

I know I can do these pairwise tests individually to avoid this issue, but I want the results in single dpsi and psivec files, so that I can do clustering of events across;

Control Group1 Group2 Group3 ... FYI: Groups are not strictly time-series, hence why I don't want to test e.g. Group3-Group2-Group1, but they represent gradations of a treatment, so it would be valuable to look for clustering trends across the levels.

Please let me know if there is any solution to this. Apologies if I have overlooked something glaringly obvious - I am a beginner!

Thank you, Daniel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/113, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB725PHFT2Q3MS4OPRTST3NDVANCNFSM4US2GN7A .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

danphillips28 commented 3 years ago

Thanks Eduardo, I'll give that a go.

The psivec part makes sense, because they have the same number of rows. However, it seems clustering also requires a single dpsi file, which have different numbers of rows for each of my pairwise comparisons. I am at a loss as to how this can be circumvented to allow me to cluster across my conditions. Any suggestions on this dpsi part?

Thanks again, Dan

EduEyras commented 3 years ago

Hi Dan,

thanks for the email.

I guess those cases missing is because you filtered them out after the deltaPSI calculation?

You only need to have the same event ids, so you can fill the missing values with NA's, or with a value such that it will never be picked up, like p-value = 1, deltaPSI = 0.

The dpsi file is only used to select which events to include in the clustering according to a deltaPSI, p value threshold between at least a pair of conditions. You can also vary those thresholds to include fewer or more events.

I hope this helps

Eduardo

On Mon, 21 Dec 2020 at 04:05, danphillips28 notifications@github.com wrote:

Thanks Eduardo, I'll give that a go.

The psivec part makes sense, because they have the same number of rows. However, it seems clustering also requires a single dpsi file, which have different numbers of rows for each of my pairwise comparisons. I am at a loss as to how this can be circumvented to allow me to cluster across my conditions. Any suggestions on this dpsi part?

Thanks again, Dan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/113#issuecomment-748633556, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB76DR73LA7IO56WWLDSVYVGDANCNFSM4US2GN7A .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

danphillips28 commented 3 years ago

Hi Eduardo,

Long story short - I got suppa2 to accept my reformatted input files and cluster my results!

We discussed combining the files by event IDs, and then making some modifications to account for the differences in the dpsi files for each of my separate pairwise comparisons (adding NAs, etc). Due to my inexperience with coding, this all felt slightly too much for me. For anyone else reading in the future who might want to do something similar;

I decided to rerun diffSplice using the --combination flag, with my conditions ordered condition1 condition2 .... control. I moved my dpsi file in to excel, where I then removed columns for the unwanted tests (e.g. condition1 vs condition2). [I did this in excel because I noticed that the header for the first column of suppa2 outputs, which is the event ID column, is actually the header for the first pairwise comparison - i.e., the headers are shifted to the left by one column. This stopped me using something simple like cut -f]. I inputted this modified dpsi file as well as the unaltered psivec file from the --combination run into cluster and it worked. What a relief!

I'd like to finish by sneaking in a quick question about silhouette scores. I know that closer to 1 is best, and in the suppa2 tutorial you report a score of just under 0.6 as well-differentiated. I'd like to know, reasonably, how low should one go? By tweaking clustering parameters it becomes clear there's a trade-off between the silhouette and the number of events in clusters. Lots of my clustering parameters leave me with a good amount of events (50-100ish) in a fair (5ish) number of clusters, but a silhouette score of around 0.4. Is this acceptable? It is more likely to allow me to continue on to motif analysis, but would less events (20-30) in fewer (2-3) clusters with a much higher silhouette score (e.g. 0.75-0.8) be preferable, generally? Your perspective here is appreciate. I have had a look in papers referencing suppa2 but very rarely find information about clustering.

Thanks again, Daniel

EduEyras commented 3 years ago

Hi Daniel,

I'm glad that it worked!

About clustering, there is no clear-cut answer. It is rather a matter of building evidence for your derived clusters.

With ideal data it might be possible to find an optimal configuration, but PSI data is quite noisy and thee silhouette score may vary a lot with the number of clusters.

A silhouette score of 0.8 is clearly better than 0.4, but as you saw, it may produce fewer and smaller clusters. Having bigger clusters might be more meaningful to do a biological interpretation down the line with motif enrichment, pathway enrichment, etc... On the other hand, having more clusters (with a lower score) might mean that you're over splitting, and that perhaps two of those clusters are very similar to each other and belong to the same class.

My suggestion would be to pick two or three configurations and run further analyses on those clusters (e.g. PSI profiles, enriched motifs, gene pathway analysis,... ) to see if they give you consistent answers and to estimate what would be the reliable number of clusters.

I hope this helps

cheers

Eduardo

On Tue, 29 Dec 2020 at 10:22, danphillips28 notifications@github.com wrote:

Hi Eduardo,

Long story short - I got suppa2 to accept my reformatted input files and cluster my results!

We discussed combining the files by event IDs, and then making some modifications to account for the differences in the dpsi files for each of my separate pairwise comparisons (adding NAs, etc). Due to my inexperience with coding, this all felt slightly too much for me. For anyone else reading in the future who might want to do something similar;

I decided to rerun diffSplice using the --combination flag, with my conditions ordered condition1 condition2 .... control. I moved my dpsi file in to excel, where I then removed columns for the unwanted tests (e.g. condition1 vs condition2). [I did this in excel because I noticed that the header for the first column of suppa2 outputs, which is the event ID column, is actually the header for the first pairwise comparison - i.e., the headers are shifted to the left by one column. This stopped me using something simple like cut -f]. I inputted this modified dpsi file as well as the unaltered psivec file from the --combination run into cluster and it worked. What a relief!

I'd like to finish by sneaking in a quick question about silhouette scores. I know that closer to 1 is best, and in the suppa2 tutorial you report a score of just under 0.6 as well-differentiated. I'd like to know, reasonably, how low should one go? By tweaking clustering parameters it becomes clear there's a trade-off between the silhouette and the number of events in clusters. Lots of my clustering parameters leave me with a good amount of events (50-100ish) in a fair (5ish) number of clusters, but a silhouette score of around 0.4. Is this acceptable? It is more likely allow me to continue on to motif analysis, but would less events (20-30) in fewer (2-3) clusters with a much higher silhouette score (e.g. 0.75-0.8) preferable, generally? Your perspective here is appreciate. I'd had a look in papers referencing suppa2 but very rarely find information about clustering.

Thanks again, Daniel

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/113#issuecomment-751893965, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB7KMYVDEUZVQRR3QODSXEHLLANCNFSM4US2GN7A .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ