comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
261 stars 62 forks source link

diffSplice returns more condition pairings than expected, with extensions _x, _y #97

Closed victorperis closed 4 years ago

victorperis commented 4 years ago

Hi, I was running the diffSplice function on some data grouped into 4 conditions (A,B,C,D for simplicity), and the .dpsi output files have more columns than you would expect. Instead of something like:

A_B_dPSI A_B_p-val A_C_dPSI A_C_p-val A_D_dPSI A_D_p-val B_C_dPSI B_C_p-val B_D_dPSI B_D_p-val C_D_dPSI C_D_p-val

the output columns might be something like:

A_B_dPSI_x A_B_p-val_y A_B_dPSI_y A_B_p-val_y A_B_dPSI A_B_p-val A_C_dPSI A_C_p-val A_D_dPSI A_D_p-val B_C_dPSI B_C_p-val B_D_dPSI B_D_p-val C_D_dPSI C_D_p-val

The dpsi and p-values are different in each of these extra columns, it's not like they are copies of each other.

I have also noticed that this also happens for some specific exon events. For AL, MX and RI events you only get the expected columns. For all other events, though, you get these extra columns (for different condition pairings each time), along with many extra rows with events that do not pertain to the one being analysed. For example, when analysing A3 events, and even though the .ioe file passed only contains A3 events, the output dPSI file looks like this (5 first lines):

african-european_dPSI african-european_p-val african-east_asian_dPSI_x african-east_asian_p-val_x african-east_asian_dPSI_y african-east_asian_p-val_y european-south_asian_dPSI_x european-south_asian_p-val_x european-south_asian_dPSI_y european-south_asian_p-val_y european-east_asian_dPSI_x european-east_asian_p-val_x european-east_asian_dPSI_y european-east_asian_p-val_y south_asian-east_asian_dPSI south_asian-east_asian_p-val ENSG00000000003;AF:X:100635746-100636191:100636689:100635746-100636793:100637104:- nan nan nan nan nan 1.0 nan nan nan 1.0 nan nan nan 1.0 nan nan ENSG00000000003;SE:X:100630866-100632485:100632568-100633405:- 0.0148107549 0.25874125870000003 nan nan nan nan nan nan nan nan nan nan nan nan nan nan ENSG00000000419;A3:20:50940955-50941105:50940933-50941105:- nan nan nan 1.0 nan nan nan 1.0 nan nan nan 1.0 nan nan nan 1.0 ENSG00000000419;A3:20:50940955-50942031:50940933-50942031:- nan nan -0.0200415926 0.2782217782 nan nan -0.0157351289 0.2812187812 nan nan -0.026643510099999997 0.2562437562 nan nan -0.010908381200000001 0.29620379620000004 ENSG00000000419;SE:20:50940933-50941105:50941209-50942031:- -0.0236211806 0.2447552448 nan nan nan nan nan nan nan nan nan nan nan nan nan nan

Notice both the extra columns and the events that are not A3.

¿Is this normal in the output?¿ is it to be expected?

Any help will be welcome, thanks Víctor

EduEyras commented 4 years ago

Hi Victor,

thanks for your message.

That behaviour is unexpected because suppa cannot produce event types that are not already in the .ioe file. Also, the comparisons and the labels are extracted directly from the input files, so the labels you show are unexpected as well.

Could provide more info on the commands that you're running and the headers of your input files?

Thanks

Eduardo

On Tue, 25 Aug 2020 at 02:55, victorperis notifications@github.com wrote:

Hi, I was running the diffSplice function on some data grouped into 4 conditions (A,B,C,D for simplicity), and the .dpsi output files have more columns than you would expect. Instead of something like:

A_B_dPSI A_B_p-val A_C_dPSI A_C_p-val A_D_dPSI A_D_p-val B_C_dPSI B_C_p-val B_D_dPSI B_D_p-val C_D_dPSI C_D_p-val

the output columns might be something like:

A_B_dPSI_x A_B_p-val_y A_B_dPSI_y A_B_p-val_y A_B_dPSI A_B_p-val A_C_dPSI A_C_p-val A_D_dPSI A_D_p-val B_C_dPSI B_C_p-val B_D_dPSI B_D_p-val C_D_dPSI C_D_p-val

The dpsi and p-values are different in each of these extra columns, it's not like they are copies of each other.

I have also noticed that this also happens for some specific exon events. For AL, MX and RI events you only get the expected columns. For all other events, though, you get these extra columns (for different condition pairings each time), along with many extra rows with events that do not pertain to the one being analysed. For example, when analysing A3 events, and even though the .ioe file passed only contains A3 events, the output dPSI file looks like this (5 first lines):

african-european_dPSI african-european_p-val african-east_asian_dPSI_x african-east_asian_p-val_x african-east_asian_dPSI_y african-east_asian_p-val_y european-south_asian_dPSI_x european-south_asian_p-val_x european-south_asian_dPSI_y european-south_asian_p-val_y european-east_asian_dPSI_x european-east_asian_p-val_x european-east_asian_dPSI_y european-east_asian_p-val_y south_asian-east_asian_dPSI south_asian-east_asian_p-val

ENSG00000000003;AF:X:100635746-100636191:100636689:100635746-100636793:100637104:- nan nan nan nan nan 1.0 nan nan nan 1.0 nan nan nan 1.0 nan nan ENSG00000000003;SE:X:100630866-100632485:100632568-100633405:- 0.0148107549 0.25874125870000003 nan nan nan nan nan nan nan nan nan nan nan nan nan nan ENSG00000000419;A3:20:50940955-50941105:50940933-50941105:- nan nan nan 1.0 nan nan nan 1.0 nan nan nan 1.0 nan nan nan 1.0 ENSG00000000419;A3:20:50940955-50942031:50940933-50942031:- nan nan -0.0200415926 0.2782217782 nan nan -0.0157351289 0.2812187812 nan nan -0.026643510099999997 0.2562437562 nan nan -0.010908381200000001 0.29620379620000004 ENSG00000000419;SE:20:50940933-50941105:50941209-50942031:- -0.0236211806 0.2447552448 nan nan nan nan nan nan nan nan nan nan nan nan nan nan

Notice both the extra columns and the events that are not A3.

¿Is this normal in the output?¿ is it to be expected?

Any help will be welcome, thanks Víctor

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/97, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKBZQPQWNKSZ2YE7XZJDSCKLR3ANCNFSM4QJWTBEQ .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

victorperis commented 4 years ago

I have rerun diffSplice for each type of event, one at a time, and this seems to have solved the problem. I was running the command as follows:

python3 suppa.py diffSplice \ -m empirical \ --psi $disk/$path/$subpath/african.placenta.RI.african.psi $disk/$path/$subpath/european.placenta.RI.european.psi $disk/$path/$subpath/south_asian.placenta.RI.south_asian.psi $disk/$path/$subpath/east_asian.placenta.RI.east_asian.psi \ --tpm $disk/$path/$subpath/african.placenta.african.transcript.TPM.tpm $disk/$path/$subpath/european.placenta.european.transcript.TPM.tpm $disk/$path/$subpath/south_asian.placenta.south_asian.transcript.TPM.tpm $disk/$path/$subpath/east_asian.placenta.east_asian.transcript.TPM.tpm \ --input $disk/$path/05_SUPPA_event_generation/event_list_exons_RI_strict.ioe \ --save_tpm_events \ --combination \ --output placenta.RI.diffSplice

But changing the type of event from RI to whatever, as needed. I think the problem was in running the above code for more than one event in parallel: waiting for diffSplice to finish in one event before running it for the next event solved the problem. ¿Could it be that diffSplice uses some temporary files with a generic name (not dependant on the event being analysed), and if you run diffSplice on different events at once it tries to access some files and you get 'contaminated' data from one event to the other?

Thanks, Víctor

EduEyras commented 4 years ago

Hi Victor,

Yes, diffSplice will keep some calculations in temporary files that are afterward concatenated.

I cc JC, who could tell you more specifically whether running things in parallel may affect the analysis.

best

E.

On Tue, 25 Aug 2020 at 19:45, victorperis notifications@github.com wrote:

I have rerun diffSplice for each type of event, one at a time, and this seems to have solved the problem. I was running the command as follows:

python3 suppa.py diffSplice -m empirical --psi $disk/$path/$subpath/african.placenta.RI.african.psi $disk/$path/$subpath/european.placenta.RI.european.psi $disk/$path/$subpath/south_asian.placenta.RI.south_asian.psi $disk/$path/$subpath/east_asian.placenta.RI.east_asian.psi --tpm $disk/$path/$subpath/african.placenta.african.transcript.TPM.tpm $disk/$path/$subpath/european.placenta.european.transcript.TPM.tpm $disk/$path/$subpath/south_asian.placenta.south_asian.transcript.TPM.tpm $disk/$path/$subpath/east_asian.placenta.east_asian.transcript.TPM.tpm --input $disk/$path/05_SUPPA_event_generation/event_list_exons_RI_strict.ioe --save_tpm_events --combination --output placenta.RI.diffSplice

But changing the type of event from RI to whatever, as needed. I think the problem was in running the above code for more than one event in parallel: waiting for diffSplice to finish in one event before running it for the next event solved the problem. ¿Could it be that diffSplice uses some temporary files with a generic name (not dependant on the event being analysed), and if you run diffSplice on different events at once it tries to access some files and you get 'contaminated' data from one event to the other?

Thanks, Víctor

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/97#issuecomment-679922469, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB27U5MYN6W2VHIJ5KLSCOB3FANCNFSM4QJWTBEQ .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ