davidaknowles / leafcutter

Annotation-free quantification of RNA splicing. Yang I. Li, David A. Knowles, Jack Humphrey, Alvaro N. Barbeira, Scott P. Dickinson, Hae Kyung Im, Jonathan K. Pritchard
http://davidaknowles.github.io/leafcutter/
Apache License 2.0
208 stars 115 forks source link

[help] leafviz gene name annotation #98

Closed jungminchoilab closed 5 years ago

jungminchoilab commented 5 years ago

Dear developers,

Thanks for sharing this software! I was exploring LeafViz and noticed that I cannot get gene name displayed properly... (please see below).

image

I also checked the *significance.txt file and it is annotated fine.

vpn172022154153:leafcutter jungminchoi$ head leafcutter_ds_cluster_significance.txt
cluster status  loglr   df  p   p.adjust    genes
chr1_KI270706v1_random:clu_18729_NA Not enough valid samples    NA  NA  NA  NA  NA
chr1:clu_1565_NA    <=1 sample with coverage>min_coverage   NA  NA  NA  NA  WASH7P
chr1:clu_1566_NA    Success 6.60150581222533    2   0.00135832112024933 0.0673325185026096  RP11-206L10.2,RP11-206L10.8
chr1:clu_1567_NA    Not enough valid samples    NA  NA  NA  NA  RP11-206L10.2
chr1:clu_1568_NA    Success 0.11468855232124    1   0.631986081094336   0.805375740752997   LINC01128
chr1:clu_1569_NA    Success 4.79372357199281    4   0.0479810862042172  0.276247488541036   LINC01128
chr1:clu_1570_NA    Success 0.0342841268193297  1   0.793432921115082   0.892415183369777   NOC2L
chr1:clu_1571_NA    Success 0.414883379105731   1   0.362339981964399   0.633420992758252   NOC2L
chr1:clu_1572_NA    Success 0.0550435978989299  1   0.740044886476189   0.864967197334483   NOC2L

I prepared the RData file as below.

vpn172022154153:leafcutter jungminchoi$ /Users/jungminchoi/Programs/leafcutter/leafviz/prepare_results.R --meta_data_file groups_file.txt --code leafcutter ec_perind_numers.counts.gz leafcutter_ds_cluster_significance.txt leafcutter_ds_effect_sizes.txt /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38
Warning message:
package ‘optparse’ was built under R version 3.5.2 
Loading required package: leafcutter
Loading required package: Rcpp
Warning message:
package ‘data.table’ was built under R version 3.5.2 
Warning message:
package ‘stringr’ was built under R version 3.5.2 

Attaching package: ‘dplyr’

The following objects are masked from ‘package:data.table’:

    between, first, last

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
package ‘dplyr’ was built under R version 3.5.2 
Preparing for visualisation
Results to be saved in: leafviz.RData 
Using annotation at: /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38 
Loading counts from ec_perind_numers.counts.gz 
Loading metadata from groups_file.txt 
Loading exons from /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38_all_exons.txt.gz 
Taking input= as a system command ('zless /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38_all_exons.txt.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure envionment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
Taking input= as a system command ('zcat < /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38_all_introns.bed.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure envionment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
Taking input= as a system command ('zcat < /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38_threeprime.bed.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure envionment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
Taking input= as a system command ('zcat < /Users/jungminchoi/Programs/leafcutter/leafviz/annotation_codes/gencode_hg38/gencode_hg38_fiveprime.bed.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure envionment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
[1] "Annotating junctions"
[1] "Preparing results"
[1] "converting counts to ratios"
`mutate_all()` ignored the following grouping variables:
Column `clu`
Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.
Warning message:
funs() is soft deprecated as of dplyr 0.8.0
please use list() instead

# Before:
funs(name = f(.)

# After: 
list(name = ~f(.))
This warning is displayed once per session. 
[1] "creating PCA"

and visualized as below

vpn172022154153:leafviz jungminchoi$ ./run_leafviz.R ~/Google\ Drive/EC/rnaseq/leafcutter/leafviz.RData 

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
package ‘dplyr’ was built under R version 3.5.2 

Attaching package: ‘DT’

The following objects are masked from ‘package:shiny’:

    dataTableOutput, renderDataTable

Attaching package: ‘gridExtra’

The following object is masked from ‘package:dplyr’:

    combine

Warning message:
package ‘optparse’ was built under R version 3.5.2 
[1] "Loading results from /Users/jungminchoi/Google Drive/EC/rnaseq/leafcutter/leafviz.RData"

Listening on http://127.0.0.1:7044

Attaching package: ‘shinyjs’

The following object is masked from ‘package:intervals’:

    show

The following object is masked from ‘package:Rcpp’:

    show

The following object is masked from ‘package:shiny’:

    runExample

The following objects are masked from ‘package:methods’:

    removeClass, show

Warning in min(exons$start) :
  no non-missing arguments to min; returning Inf
Warning in max(exons$end) :
  no non-missing arguments to max; returning -Inf
TableGrob (2 x 1) "arrange": 2 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (2-2,1-1) arrange gtable[layout]
Warning: Error in make_gene_plot: length(unique(exons$chr)) == 1 is not TRUE
  174: <Anonymous>

Any of your inputs will be appreciated. Looking forward to hearing from you soon and please let me know if you need other information from me.

Best, Jungmin

jackhump commented 5 years ago

After discussing privately, it would appear that this users bug originated from using HISAT to align to the data. Switching to STAR fixed this.

afadda91 commented 4 years ago

i have the same issue. i'm using star ERROR: length(unique(exons$chr)) == 1 is not TRUE

jackhump commented 4 years ago

Hello! Can you elaborate on how you prepared the leafviz RData file like the user above did?

A lot of leafviz errors arise from mismatches between the RNA-seq BAM files provided by the user and the GENCODE annotation files provided by us. Can you verify that your BAM files are aligned to the same genome build as the annotation files (eg hg38 or hg19) and that the chromosome naming scheme is [chr1,chr2,chr3,...] rather than [1,2,3] ?

afadda91 commented 4 years ago

Hi Jack, the bam files were generated in a two step STAR run, using hg38. I guess you will have to take my word for it, since it was done in a script with many variables.

then converting bams to junc: image

clustering: image

differential splicing image

visualization image an example of the plot generated here: Screen Shot 2020-01-12 at 5 42 51 PM

visualize with LeafViz image

this is the result: Screen Shot 2020-01-12 at 5 47 54 PM

you can see that while a gene name appeared in the first shot, no gene names showed up by LeafViz.

thanks

jackhump commented 4 years ago

Curious. Can you paste over the first 10 lines of the perind_numers.counts.gz and the gencode_hg38.exons.txt files inside the annotation_codes/gencode_hg38 / folder please?

afadda91 commented 4 years ago

sure.. gencode_hg38_all_exons.txt chr start end strand gene_name chr1 11869 12227 + processed_transcript chr1 12613 12721 + processed_transcript chr1 13221 14409 + processed_transcript chr1 12010 12057 + transcribed_unprocessed_pseudogene chr1 12179 12227 + transcribed_unprocessed_pseudogene chr1 12613 12697 + transcribed_unprocessed_pseudogene chr1 12975 13052 + transcribed_unprocessed_pseudogene chr1 13221 13374 + transcribed_unprocessed_pseudogene chr1 13453 13670 + transcribed_unprocessed_pseudogene chr1 29534 29570 - unprocessed_pseudogene chr1 24738 24891 - unprocessed_pseudogene chr1 18268 18366 - unprocessed_pseudogene chr1 17915 18061 - unprocessed_pseudogene chr1 17606 17742 - unprocessed_pseudogene chr1 17233 17368 - unprocessed_pseudogene chr1 16858 17055 - unprocessed_pseudogene chr1 16607 16765 - unprocessed_pseudogene chr1 15796 15947 - unprocessed_pseudogene chr1 15005 15038 - unprocessed_pseudogene chr1 14404 14501 - unprocessed_pseudogene

perind_numers.counts.gz 0700356301 0700356291 0700356331 0700356311 0700356321 0700356341 chr2:219001:221464:clu_1_NA 5 0 0 0 0 1 chr2:219001:224864:clu_1_NA 11 7 6 16 6 4 chr2:219001:229966:clu_1_NA 19 12 13 10 16 6 chr2:224920:229966:clu_1_NA 15 7 6 14 10 2 chr2:231191:233101:clu_2_NA 15 13 10 14 5 13 chr2:231191:234160:clu_2_NA 1 0 1 1 1 1 chr2:233229:234160:clu_2_NA 27 13 10 11 16 20 chr2:253115:256207:clu_3_NA 2 1 0 0 0 2 chr2:253115:260085:clu_3_NA 7 2 5 5 1 6 chr2:253115:263984:clu_3_NA 1 2 2 0 2 1 chr2:271939:272037:clu_4_NA 39 81 50 40 38 35 chr2:271939:272192:clu_4_NA 13 2 4 5 8 7 chr2:272065:272192:clu_4_NA 2 8 2 7 7 2 chr2:272150:275140:clu_5_NA 23 39 30 28 26 19 chr2:272150:276980:clu_5_NA 1 3 1 1 1 1

afadda91 commented 4 years ago

Hi Jack, did you figure out the issue?

thanks

jackhump commented 4 years ago

Hi afadda91,

It looks like there's something wrong with your gencode_hg38_all_exons.txt file. Did you construct it yourself or did you download it? David hasn't been maintaining the downloaded example files very well 😛 . Try making it yourself from a GENCODE GTF file (here: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_33/gencode.v33.annotation.gtf.gz) and using our included perl script (gtf2leafcutter.pl) to create it again.

Let me know how you get on. Sorry for the delay!

afadda91 commented 4 years ago

thanks a lot Jack. it worked to an extent. i still get some unlabeled genes.

Screen Shot 2020-01-28 at 12 58 11 PM
afadda91 commented 4 years ago

one more question, what do you consider a differentially spliced site in terms of delta PSI?

jackhump commented 4 years ago

You always get some unlabelled genes. Those in your screen shot look like artifacts with so many junctions.

For dPSI, the standard is a 10% change as that's about the limit for how small a change you can reliably validate with RT-PCR. But it depends on your dataset and what you'd expect to see. If you have a splicing factor knockdown in a cell line or you're comparing two very different tissues you may see dPSI of 50%-100%. But if you're looking in a big human cohort comparing some disease where splicing is indirectly involved you may only find changes of a few %, even if they have very low adjusted P values.

eternalapple commented 4 years ago

Hi, I have got the *significance.txt file, but it is not annotated(column "gene" is missing). Could you tell me how to annotate the intron cluster? Thanks! @jackhump @afadda91 @jungminchoilab

jackhump commented 4 years ago

Hi! You can use prepare_results.R as above to annotate each cluster with a gene name. Make sure to create the annotation file yourself using our perl script.