comprna / SUPPA

SUPPA: Fast quantification of splicing and differential splicing
MIT License
259 stars 61 forks source link

ERROR:lib.tools:29755, in line 29756. Skipping line... #59

Closed ashleymaeconard closed 5 years ago

ashleymaeconard commented 5 years ago

Hi - I am interested to use SUPPA to do differential splicing analysis on drosophila melanogaster and I have been working through the tutorial. I am getting this error when I run

$ suppa.py psiPerEvent -i ../flybase_dm6_events/flybase_dm6.events.ioe -e ../merged/iso_tpm_merged.csv -o ee_events 2> errors.txt

$ head -30 errors.txt INFO:lib.tools:File merged/iso_tpm_merged_FBgn_only.txt opened in reading mode. INFO:psiCalculator:Buffering transcript expression levels. ERROR:lib.tools:1, in line 2. Skipping line... ERROR:lib.tools:2, in line 3. Skipping line... ERROR:lib.tools:3, in line 4. Skipping line... ERROR:lib.tools:4, in line 5. Skipping line... ERROR:lib.tools:5, in line 6. Skipping line... ERROR:lib.tools:6, in line 7. Skipping line... ERROR:lib.tools:7, in line 8. Skipping line... ERROR:lib.tools:8, in line 9. Skipping line... ... ERROR:lib.tools:27008, in line 27009. Skipping line... ERROR:lib.tools:27009, in line 27010. Skipping line... INFO:lib.tools:File merged/iso_tpm_merged_FBgn_only.txt closed. ERROR:psiCalculator:No expression values have been buffered. ERROR:psiCalculator:Unknown error: 1

The two files that are needed I generated using the tutorial. Namely, flybase_dm6.events.ioe $ head flybase_dm6.events.ioe seqname gene_id event_id alternative_transcripts total_transcripts 2L FBgn0031208 FBgn0031208;A3:2L:8116-8193:8116-8229:+ FBtr0300689,FBtr0300690 FBtr0330654,FBtr0300689,FBtr0300690 2L FBgn0031217 FBgn0031217;A3:2L:103434-103516:103434-103878:+ FBtr0078104 FBtr0078104,FBtr0330636 2L FBgn0031228 FBgn0031228;A3:2L:114991-155546:114991-155638:+ FBtr0330638 FBtr0330638,FBtr0330639 ...

iso_tpm_merged.csv 0-2_f_ci.1 0-2_f_ci.2 0-2_f_ci.3 0-2_f_ci.4 0-2_f_cc.1 0-2_f_cc.2 0-2_f_cc.3 0-2_f_cc.4 0-2_m_ci.1 0-2_m_ci.2 0-2_m_ci.3 0-2_m_ci.4 0-2_m_cc.1 0-2_m_cc.2 0-2_m_cc.3 0-2_m_cc.4 2-4_f_ci.1 2-4_f_ci.2 2-4_f_ci.3 2-4_f_ci.4 2-4_f_cc.1 2-4_f_cc.2 2-4_f_cc.3 2-4_f_cc.4 2-4_m_ci.1 2-4_m_ci.2 2-4_m_ci.3 2-4_m_ci.4 2-4_m_cc.1 2-4_m_cc.2 2-4_m_cc.3 2-4_m_cc.4 FBgn0261360 0.595611 0.382288 0.079516 0.49717700000000004 0.607855 0.519555 0.43858 0.7293810000000001 0.793188 0.46586400000000006 0.0 0.6440640000000001 0.359166 0.385799 0.372081 1.237212 1.2781870000000002 1.203019 0.6785399999999999 0.8769399999999999 1.256039 0.796237 0.797081 0.956275 0.46267700000000006 0.78455 1.2332319999999999 0.381865 0.6048939999999999 0.688028 0.662562 0.7031470000000001

Thank you in advance for helping me. I am really trying to make this work:(

EduEyras commented 5 years ago

Hi Ashley,

thanks for your message. Is your expression file tab-separated? That could be the issue. That's the first thing that springs to mind. Please, let me know if that works.

Thanks

Eduardo

ashleymaeconard commented 5 years ago

Yes, it is tab separated. I am sorry I wrote .csv but it is iso_tpm_merged.txt in the actual code I run. I see a couple hundred lines that are filled out but the majority are not, and are instead nan. I added examples on the post above of the files.

I am not sure how to proceed. I can send you the full output files to your email if that is useful.

ashleymaeconard commented 5 years ago

Is this a normal error message that you have seen before? I begin with NM_ IDs which I then convert into Flybase IDs (similarly to how you use ~/Rscript ~/scripts/format_Ensembl_ids.R iso_tpm.txt). There are Flybase IDs in the file (iso_tpm_merged above).

Now, for example when you see "ERROR:lib.tools:1, in line 2. Skipping line..." in the error message above, that is gene FBgn0261360 in the iso_tpm_merged file, which has multiple instances in the genes.gtf file because there are multiple exons. Is that the problem? $ grep FBgn0261360 genes.gtf 3L FlyBase UTR 19719527 19720443 . - . gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; transcript_id "FBtr0302279"; transcript_name "CG42637-RC"; transc ript_source "FlyBase"; transcript_version "1"; tss_id "TSS3435"; 3L FlyBase UTR 19719527 19720443 . - . gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; transcript_id "FBtr0302278"; transcript_name "CG42637-RB"; transc ript_source "FlyBase"; transcript_version "1"; tss_id "TSS3435"; 3L FlyBase UTR 19719527 19720443 . - . gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; transcript_id "FBtr0302277"; transcript_name "CG42637-RA"; transc ript_source "FlyBase"; transcript_version "1"; tss_id "TSS792"; 3L FlyBase exon 19719527 19722174 . - . exon_id "FBtr0302277-E18"; exon_number "19"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; tr anscript_id "FBtr0302279"; transcript_name "CG42637-RC"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS3435";

Another idea is that when converting from NM_ (refseq) to Flybase IDs, some of the gene level Flybase IDs (FBgn...) are redundant, which could have caused the problem. But this idea is incorrect because when I just have unique flybase IDs I still have the same type of error.

In fact I have tried the following versions of this suppa.py command. There are FBgn (gene ids) and FBtr (transcript ids). When converting from NM refseq IDs, there are also some NR refseq IDs which refer to the FBtr ids. I tried the following 4 combinations of input iso_tpm_merged files:

In each case I am getting the same types of errors, except at the end of the file for 2,3, and 4 I also get this error, which does not print an output file... so I have no idea what is happening.

ERROR:lib.tools:27007, in line 27008. Skipping line... ERROR:lib.tools:27008, in line 27009. Skipping line... ERROR:lib.tools:27009, in line 27010. Skipping line... INFO:lib.tools:File merged/iso_tpm_merged_FBgn_only_noDups.txt closed. ERROR:psiCalculator:No expression values have been buffered. ERROR:psiCalculator:Unknown error: 1

Thank you for your prompt response yesterday! I look forward to hearing from you.

EduEyras commented 5 years ago

Hi Ashley,

There might be some issues after the conversion. I know that in human different NM_ IDs (transcripts) could map to multiple Ensembl transcript and/or gene IDs. This is because RefSeq annotates molecules (RNA sequences), whereas Ensembl annotates loci, and multiple loci may code for the same molecule. I wonder whether a similar thing would happen with Flybase.

Another thing is that SUPPA only uses the "exon" lines from the GTF (or the "CDS" lines if you only wanted to calculated events in CDS regions). Having extra lines should not matter in any case.

If there is an ID missing from the expression file, it may throw an error but continue happily and provide an input. Do you get any input?

It might be worthwhile trying with an individual event to try to see what could be the problem.

I hope this helps

cheers

E.

On Mon, 16 Sep 2019 at 02:51, Ashley Conard notifications@github.com wrote:

Is this a normal error message that you have seen before? I begin with NM_ IDs which I then convert into Flybase IDs (similarly to how you use ~/Rscript ~/scripts/format_Ensembl_ids.R iso_tpm.txt). There are Flybase IDs in the file (iso_tpm_merged above).

Now, for example when you see "ERROR:lib.tools:1, in line 2. Skipping line..." in the error message above, that is gene FBgn0261360 in the iso_tpm_merged file, which has multiple instances in the genes.gtf file because there are multiple exons. Is that the problem? $ grep FBgn0261360 genes.gtf 3L FlyBase UTR 19719527 19720443 . - . gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; transcript_id "FBtr0302279"; transcript_name "CG42637-RC"; transc ript_source "FlyBase"; transcript_version "1"; tss_id "TSS3435"; 3L FlyBase UTR 19719527 19720443 . - . gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; transcript_id "FBtr0302278"; transcript_name "CG42637-RB"; transc ript_source "FlyBase"; transcript_version "1"; tss_id "TSS3435"; 3L FlyBase UTR 19719527 19720443 . - . gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; transcript_id "FBtr0302277"; transcript_name "CG42637-RA"; transc ript_source "FlyBase"; transcript_version "1"; tss_id "TSS792"; 3L FlyBase exon 19719527 19722174 . - . exon_id "FBtr0302277-E18"; exon_number "19"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0261360"; gene_name "CG42637"; gene_source "FlyBase"; gene_version "1"; p_id "P21983"; transcript_biotype "protein_coding"; tr anscript_id "FBtr0302279"; transcript_name "CG42637-RC"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS3435";

Another idea is that when converting from NM_ (refseq) to Flybase IDs, some of the gene level Flybase IDs (FBgn...) are redundant, which could have caused the problem. But this idea is incorrect because when I just have unique flybase IDs I still have the same type of error.

In fact I have tried the following versions of this suppa.py command. There are FBgn (gene ids) and FBtr (transcript ids). When converting from NM refseq IDs, there are also some NR refseq IDs which refer to the FBtr ids. I tried the following 4 combinations of input iso_tpm_merged files:

  • FBgn IDs and FBtr IDs allowed
  • FBgn IDs and FBtr IDs allowed but no duplicate ids in rows (so only the unique IDs are used)
  • FBgn IDs only allowed
  • FBgn IDs and no duplicate ids in rows

In each case I am getting the same types of errors... so I have no idea what is happening.

Thank you for your prompt response yesterday! I look forward to hearing from you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKB7V7WNRFQUPODZQBXLQJZRX7A5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6XURDA#issuecomment-531581068, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKB6MW2XA6CFOLPZRGILQJZRX7ANCNFSM4IWYRELA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

ashleymaeconard commented 5 years ago

In response to your comment "It might be worthwhile trying with an individual event to try to see what could be the problem." I have certainly used small examples to get at this issue.

Have you run SUPPA using drosophila melanogaster inputs (i.e. this .gtf file (ftp://ftp.flybase.org/releases/FB2019_04/precomputed_files/genes/fbgn_NAseq_Uniprot_fb_2019_04.tsv.gz) and the Flybase gene ids?

For each flybase gene id (FBgn ...) there are also several flybase transcript ids (FBtr ...).

Is the output from the events.ioe file normal? This is part of the output I shared in my original post. flybase_dm6.events.ioe $ head flybase_dm6.events.ioe seqname gene_id event_id alternative_transcripts total_transcripts 2L FBgn0031208 FBgn0031208;A3:2L:8116-8193:8116-8229:+ FBtr0300689,FBtr0300690 FBtr0330654,FBtr0300689,FBtr0300690 2L FBgn0031217 FBgn0031217;A3:2L:103434-103516:103434-103878:+ FBtr0078104 FBtr0078104,FBtr0330636 2L FBgn0031228 FBgn0031228;A3:2L:114991-155546:114991-155638:+ FBtr0330638 FBtr0330638,FBtr0330639

ashleymaeconard commented 5 years ago

I have decided to proceed even though there are these errors. I have several hundered events in the ee_events.psi file. What does this error mean for my input. I formatted my iso_tpm.txt file to look like yours (as I showed in my first post).

$ Rscript ../scripts/SUPPA-2.3/scripts/split_file.R iso_tpm_merged.txt 0-2_f_RNAi.1,0-2_f_RNAi.2,0-2_f_RNAi.3,0-2_f_RNAi.4 0-2_f_control.1,0-2_f_control.2,0-2_f_control.3,0-2_f_control.4 RNAi_iso.tpm control_iso.tpm -i [1] "Parsing samples..." [1] "Loading iso_tpm_merged.txt..." Error in [.data.frame(input_file, first_condition) : undefined columns selected Calls: [ -> [.data.frame Execution halted

The columns are named the same as in the list here and they are in the same order.

ashleymaeconard commented 5 years ago

I went into your code and fixed it. Thank you for nicely commented code!

One caution: you may want to change your tutorial to highlight that you need a filepath before the iso_tpm_formatted.txt file or change your code. Without a filepath now it does not run.

EduEyras commented 5 years ago

Hi,

yes, that looks fine

E.

On Tue, 17 Sep 2019 at 02:26, Ashley Conard notifications@github.com wrote:

In response to your comment "It might be worthwhile trying with an individual event to try to see what could be the problem." I have certainly used small examples to get at this issue. Have you run SUPPA using drosophila melanogaster inputs (i.e. this .gtf file ( ftp://ftp.flybase.org/releases/FB2019_04/precomputed_files/genes/fbgn_NAseq_Uniprot_fb_2019_04.tsv.gz) and the Flybase gene ids?)

For each flybase gene id (FBgn ...) there are also several flybase transcript ids (FBtr ...).

Is the output from the events.ioe file normal? This is part of the output I shared in my original post. flybase_dm6.events.ioe $ head flybase_dm6.events.ioe seqname gene_id event_id alternative_transcripts total_transcripts 2L FBgn0031208 FBgn0031208;A3:2L:8116-8193:8116-8229:+ FBtr0300689,FBtr0300690 FBtr0330654,FBtr0300689,FBtr0300690 2L FBgn0031217 FBgn0031217;A3:2L:103434-103516:103434-103878:+ FBtr0078104 FBtr0078104,FBtr0330636 2L FBgn0031228 FBgn0031228;A3:2L:114991-155546:114991-155638:+ FBtr0330638 FBtr0330638,FBtr0330639

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKB5SUJ7QQDBENOWKZCLQJ6XVBA5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZW7EQ#issuecomment-531853202, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKBZUROPSYV5PKIAJ4S3QJ6XVBANCNFSM4IWYRELA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras commented 5 years ago

I cc Juan Luis, who wrote that R script. It may just be a formatting issue E.

On Tue, 17 Sep 2019 at 06:41, Ashley Conard notifications@github.com wrote:

I have decided to proceed even though there are these errors. I have several hundered events in the ee_events.psi file. What does this error mean for my input. I formatted my iso_tpm.txt file to look like yours (as I showed in my first post).

$ Rscript ../scripts/SUPPA-2.3/scripts/split_file.R iso_tpm_merged.txt 0-2_f_RNAi.1,0-2_f_RNAi.2,0-2_f_RNAi.3,0-2_f_RNAi.4 0-2_f_control.1,0-2_f_control.2,0-2_f_control.3,0-2_f_control.4 RNAi_iso.tpm control_iso.tpm -i [1] "Parsing samples..." [1] "Loading iso_tpm_merged.txt..." Error in [.data.frame(input_file, first_condition) : undefined columns selected Calls: [ -> [.data.frame Execution halted

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKB3T4B7YIP4DRP4WMO3QJ7VP5A5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62OP3Q#issuecomment-531949550, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKB6CUK4DGSI2HHZXLPTQJ7VP5ANCNFSM4IWYRELA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras commented 5 years ago

Thanks,

string2 is defined after string, if you use string and not string2, I wonder why string2 is then defined.

string <- unlist(strsplit(CHARACTER_command_args[1],"/")) string2 <- paste(string[-length(string)],collapse = "/") path1 <- paste0(string2,"/",CHARACTER_command_args[4]) write.table(first_output,file=path1,quote=FALSE,sep="\t") print(paste0("Saved ",path1))

Could it be something specific of the format of the IDs? I cc Juan Luis in case he can clarify.

Will that fix work with the examples in the wiki?

Thanks

E.

On Tue, 17 Sep 2019 at 06:55, Ashley Conard notifications@github.com wrote:

I went into your code and fixed it. Thank you for nicely commented code!

One error that you may choose to fix is that your naming for path1 should assign string and not string2 to the name

path1 <- paste0(string2,"/",CHARACTER_command_args[4]) write.table(first_output,file=path1,quote=FALSE,sep="\t") print(paste0("Saved ",path1))

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKBZFH3SLXVM5KS6RWRLQJ7XDVA5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62PWUA#issuecomment-531954512, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKBZWBQ2BJJSH4UDZ7WTQJ7XDVANCNFSM4IWYRELA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

EduEyras commented 5 years ago

Hi Ashley,

some of the errors could be because there is expression value for some of the transcript IDs. Is that the case?

Did you also check that there are no versions in the IDs? (e.g. id3.1 vs id3.2)

cheers

E.

On Tue, 17 Sep 2019 at 06:41, Ashley Conard notifications@github.com wrote:

I have decided to proceed even though there are these errors. I have several hundered events in the ee_events.psi file. What does this error mean for my input. I formatted my iso_tpm.txt file to look like yours (as I showed in my first post).

$ Rscript ../scripts/SUPPA-2.3/scripts/split_file.R iso_tpm_merged.txt 0-2_f_RNAi.1,0-2_f_RNAi.2,0-2_f_RNAi.3,0-2_f_RNAi.4 0-2_f_control.1,0-2_f_control.2,0-2_f_control.3,0-2_f_control.4 RNAi_iso.tpm control_iso.tpm -i [1] "Parsing samples..." [1] "Loading iso_tpm_merged.txt..." Error in [.data.frame(input_file, first_condition) : undefined columns selected Calls: [ -> [.data.frame Execution halted

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKB3T4B7YIP4DRP4WMO3QJ7VP5A5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD62OP3Q#issuecomment-531949550, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKB6CUK4DGSI2HHZXLPTQJ7VP5ANCNFSM4IWYRELA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

ashleymaeconard commented 5 years ago

Hi again,

Yes the error is due to unmapped or unknown transcript IDs. Yes I had already removed the version IDs. In the end I only lose 0.005 of them so I moved on.

In the output from PSI - how am I supposed to map back IDs like this? FBgn0000259;A5:X:11793279-11793443:11792914-11793443:+ In the GTF file there are many possibilities: X FlyBase UTR 11792879 11793279 . + . gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P20174"; transcript_biotype "protein_coding"; transcript_id "FBtr0073560"; transcript_name "CkIIbeta-RC"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase UTR 11792879 11793279 . + . gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P18733"; transcript_biotype "protein_coding"; transcript_id "FBtr0333204"; transcript_name "CkIIbeta-RK"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase UTR 11792879 11793279 . + . gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P8682"; transcript_biotype "protein_coding"; transcript_id "FBtr0308084"; transcript_name "CkIIbeta-RI"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase exon 11792879 11793279 . + . exon_id "FBtr0073560-E1"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P20174"; transcript_biotype "protein_coding"; transcript_id "FBtr0073560"; transcript_name "CkIIbeta-RC"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase exon 11792879 11793279 . + . exon_id "FBtr0073560-E1"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P18733"; transcript_biotype "protein_coding"; transcript_id "FBtr0333204"; transcript_name "CkIIbeta-RK"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase exon 11792879 11793279 . + . exon_id "FBtr0073560-E1"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P8682"; transcript_biotype "protein_coding"; transcript_id "FBtr0308084"; transcript_name "CkIIbeta-RI"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140";

EduEyras commented 5 years ago

Hi Ashley,

SUPPA also produces a gtf that you can use for visualization purposes, where each id corresponds to an event. Is that what you meant?

E

On Sat, 21 Sep 2019 at 09:46, Ashley Conard notifications@github.com wrote:

Hi again,

Yes the error is due to unmapped or unknown transcript IDs. Yes I had already removed the version IDs. In the end I only lose 0.005 of them so I moved on.

In the output from PSI - how am I supposed to map back IDs like this? FBgn0000259;A5:X:11793279-11793443:11792914-11793443:+ In the GTF file there are many possibilities: X FlyBase UTR 11792879 11793279 . + . gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P20174"; transcript_biotype "protein_coding"; transcript_id "FBtr0073560"; transcript_name "CkIIbeta-RC"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase UTR 11792879 11793279 . + . gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P18733"; transcript_biotype "protein_coding"; transcript_id "FBtr0333204"; transcript_name "CkIIbeta-RK"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase UTR 11792879 11793279 . + . gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P8682"; transcript_biotype "protein_coding"; transcript_id "FBtr0308084"; transcript_name "CkIIbeta-RI"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase exon 11792879 11793279 . + . exon_id "FBtr0073560-E1"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P20174"; transcript_biotype "protein_coding"; transcript_id "FBtr0073560"; transcript_name "CkIIbeta-RC"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase exon 11792879 11793279 . + . exon_id "FBtr0073560-E1"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P18733"; transcript_biotype "protein_coding"; transcript_id "FBtr0333204"; transcript_name "CkIIbeta-RK"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140"; X FlyBase exon 11792879 11793279 . + . exon_id "FBtr0073560-E1"; exon_number "1"; exon_version "1"; gene_biotype "protein_coding"; gene_id "FBgn0000259"; gene_name "CkIIbeta"; gene_source "FlyBase"; gene_version "1"; p_id "P8682"; transcript_biotype "protein_coding"; transcript_id "FBtr0308084"; transcript_name "CkIIbeta-RI"; transcript_source "FlyBase"; transcript_version "1"; tss_id "TSS13140";

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKB77JMMGWLINFZILUG3QKVOD5A5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IFBFA#issuecomment-533745812, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKB7F2ESDRNMZMGTQCM3QKVOD5ANCNFSM4IWYRELA .

--

ashleymaeconard commented 5 years ago

Hi! Thank you so much for your quick response.

Indeed - In the tutorial (https://github.com/comprna/SUPPA/wiki/SUPPA2-tutorial) the file where I extracted this event (FBgn0000259;A5:X:11793279-11793443:11792914-11793443:+) is the file TRA2_diffSplice.dpsi, which as you write "gives the DeltaPSI as the difference of the mean PSI between conditions, and the p_value of this difference." I have the same file for my results. The gtf file I am referring to is equivalent to this one in your tutorial: ~/annotation/Homo_sapiens.GRCh37.75.formatted.gtf.

I have been using my analysis version of TRA2_diffSplice.dpsi for downstream analysis plots for deltaPSI. I really like how the ID has the type of splicing event it is (e.g. A5, A3, SE, ...) but I want to now add a transcript ID that is interpretable by my collaborators (biologists) and they need the gene and ideally the transcript ID. I realize there might not be a 1-1 mapping because the splicing events are sometimes unique (such as the example here). So with that in mind what is your suggestion for the best way to add a transcript ID? At the moment I just pick the one where the start and end overlap the most of all the transcript ID options (an example above where I have to choose 1 event and map to 1 of 6 possible FBtr... IDs.

Thank you! So far our results make sense and our collaborators are happy to have both DTU and PSI values as output.

EduEyras commented 5 years ago

Hi Ashley,

I think I now understand what you mean. From the .ioe file, you can read for each event the transcript ID (or IDs) that would correspond to the inclusion of the event. That is, the last two columns have the (comma separated) transcript IDs that describe the inclusion form, and the last column corresponds to the (comma separated) IDs that participate in the event, i.e. either inclusion or exclusion. That should help you to select the transcripts of interest. Generally, there is one that is most abundant that contributes to the event. If the gene has many transcripts, this may be more spread across transcripts, and you may need to take several of them. I hope this helps Best

Eduardo

On Sun, 22 Sep 2019 at 01:37, Ashley Conard notifications@github.com wrote:

Hi! Thank you so much for your quick response.

Indeed - In the tutorial ( https://github.com/comprna/SUPPA/wiki/SUPPA2-tutorial) the file where I extracted this event is the file TRA2_diffSplice.dpsi, which as you write "gives the DeltaPSI as the difference of the mean PSI between conditions, and the p_value of this difference." I have the same file for my results.

Is this the file you are referring to as the gtf used to visualize results? This is the file I have been using to produce downstream analysis plots for deltaPSI. I really like how the ID has the type of splicing event it is (e.g. A5, A3, SE, ...) but I want to now add a transcript ID that is interpretable by my collaborators (biologists) and they need the gene and ideally the transcript ID. I realize there might not be a 1-1 mapping because the splicing events are sometimes unique (such as the example here). So with that in mind what is your suggestion for the best way to add a transcript ID? At the moment I just pick the one where the start and end overlap the most of all the transcript ID options (an example above where I have to choose 1 event and map to 1 of 6 possible FBtr... IDs.

Thank you! So far our results make sense and our collaborators are happy to have both DTU and PSI values as output.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/SUPPA/issues/59?email_source=notifications&email_token=ADCZKB7M2UBOVCO35KSMWK3QKY5UNA5CNFSM4IWYRELKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7IUBUQ#issuecomment-533807314, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCZKBZ7HE3A5VJUIRLZRITQKY5UNANCNFSM4IWYRELA .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ