Lack of functionality for many gene sets

andreyurch commented 5 years ago

Dear singscore team,

Thank you very much for this very useful tool. I found what I was looking for.

The only big problems I see now: 1) No examples of how to work with many gene signatures. For example, I want to score and statistically test 9get p-values) for 100 samples and 100 of gene sets from MGIB. How can I do this efficiently? Without many loops and code writing? 2) The examples are not very good - they are embedded in the R package, so it is not clear which format I need to use for the gene sets. Would be great to provide them as separate tables which user need to download and process.

Best regards, Andrey

MomenehForoutan commented 5 years ago

Hi Andrey, Thanks for trying out the singscore, and thank you so much for the feedback!

Are you using the Bioconductor version of the package and its associated documentation http://bioconductor.org/packages/release/bioc/html/singscore.html ? We will add more examples in the new release to address the difficulties you mentioned.

One important thing to consider when scoring samples against many gene signatures is that not all signatures can be used always in the same way. For example,(a) we may have a signature whose genes are known to be highly expressed in highly a certain samples (e.g. high expression of proliferation genes in highly proliferative samples), or (b) a signature may composed of both up and down regulated gene sets which must be used together as a single signature, or (c) in some cases we may have a list of genes without having any clue of whether they are expected to be up or down regulated in a sample with high evidence of the signature... so this is important to make sure that we do not run all these signatures with the same argument setting when we score samples against many different signatures. Regarding your second question, signatures can simply be a character vector of Gene Symbols, Entrez Gene IDs, or any other gene ID; however, they must be in the same format as the row names of the ranked expression matrix that you use to score the samples, and they should be unique IDs.

Regarding your first question, if you have several gene signatures, and for example they are all expected to be up-regulated, you could run the below code to get a matrix of gene sets (in row) and samples (in column) back; here, each of the sig1, sig2 and sig3 are a vector of gene symbols and rankedData is the output of the rankGene()function and has gene symbols as row names.

sigList <- list (sig1, sig2, sig3)

scoreData <- plyr::ldply(sigList, function(s){
  sc <- simpleScore(rankData = rankedData, 
    upSet = s, 
    centerScore = T, 
    knownDirection = T)
  return(sc$TotalScore)
})

You could use relatively similar code for p-value calculation. Hope this is useful, and thanks again for your feedback! Please let us know if you have further questions, comments, or feedback. Cheers, Sepideh

P.S. This is now published in BMC Bioinformatics if you are interested to read more about this method https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2435-4

andreyurch commented 5 years ago

Dear Sepideh,

Thank you very much for the detailed explanation. Yes, I use the Bioconductor version.

Regarding the Up and Down regulated genes: in the classic GSEA we have NS score and associated P-value. So if I have Ns score = -5 and P-val = 5E-20 I can conclude that the set of genes is significantly downregulated (or upregulated if NS >0 and significance is high). What is analogue of this option for singscore, "C)" as I understand?

best regards, Andrey

MomenehForoutan commented 5 years ago

Hi Andrey, Thank you for your question!

Please note that the classic GSEA method is a supervised method where you compare two groups of samples, while singscore is an unsupervised method and you do not need to know any sample labels. The GSEA method provides scores and p-values for each gene set comparing the two groups, however, singscore gives scores and its associated p-values per sample. However, it is also possible to score all samples using singscore and then perform t-test between the two groups’ scores. If you are interested to perform statistical tests comparing the groups’ scores, then I would recommend that you see the types of simulations that we provided in the paper; however, if you are interested to have scores and p-values per sample, then:

If you have situation (a) as explained above; you score your samples by passing the signature to the upSet argument and leave downSet to be NULL when running simpleScore() function. You need to make sure that you set knownDirection = TRUE. This will give you scores for your samples against that specific signature. In this case scores are around zero, with a significant positive score for a sample representing up regulation of the signature genes in that sample, while a significant negative score represents down-regulation of genes in that sample. The significance of the observed scores in each sample can be calculated using generateNull() and getPvals() functions. The situation would be very similar if you have known up- and down-regulated genes in your signature (situation b), whereby you additionally specify the downSet argument accordingly in the above functions.

If you have situation (c), then it means that you do not know the direction of gene sets (i.e. your gene-set is possibly composed of both up- and down-regulated genes). In this case, you only provide upSet, set knowDirection = FALSE, and set centerScore = FALSE. Again, you could use generateNull() and getPvals() functions to calculate significance level of the scores.

In general, based on our experience, it is much better to know the direction of the gene-set or infer it if possible. I would only go with situation (c) if I do not have any information about the gene set at all (which is very rare).

Hope these are useful. Please let us know if you have further question.

Cheers, Sepideh

jcursons commented 5 years ago

Hi Andrey, thanks for your interest in the package and for getting in touch.

I haven't used MGIB myself, but I have had reasonable success in taking signatures directly from mSigDB (http://software.broadinstitute.org/gsea/msigdb) and using them with singscore. In many cases the mSigDB signatures are paired, with "_UP" and "_DN" signatures that can be used in the approach Sepideh outlines for (b). If paired signatures are not available, then I usually work with an assumption (which should be checked with the barcode plots etc) that these genes can be expected as upregulated, especially if they are considered as "cell type marker genes" or something similar.

With respect to your question about doing this efficiently across large numbers of gene sets, the gene ranking procedure is probably the most computationally expensive step of the process, and the ldply/function call that Sepideh suggests above uses pre-calculated "rankedData", so this should run relatively quickly over a large number of gene sets. The initial ranking step can take some time with large data sets - this will scale linearly with the number of samples and in a non-linear fashion (depending on the efficiency of the sorting algorithm) with the number of genes present in the transcriptomic data, but it needs to only be run once.

As a final note, you do need to be careful about directly comparing scores from different signatures (particularly those with large numbers of highly-expressed ribosomal genes/cytoskeletal components against something enriched for TFs etc) or signatures across different measurement platforms (total RNA vs poly-A enriched RNA vs various microarrays), however, with appropriate filtering (e.g. to take only genes which are in the union across two different platforms) we have found this approach to be relatively robust. Hopefully this is demonstrated to some extent in Sepideh & Dharmesh's BMC Bioinformatics manuscript.

Thanks, Joe

bhuvad commented 5 years ago

Hi Andrey,

The new update for singscore (April 2019) will have a function multiScore to apply singscore to many signatures. This should be more efficient than applying the simpleScore on each of the signatures/gene-sets independently as all the warnings and outputs are suppressed. You can download the developer version from our repo to have a go at it.

I'm not sure whether you still need extra documentation for the package but if you do, we are in the process of releasing an R/Bioconductor workflow package showing how a real, large scale dataset can be processed using singscore and signatures from the MSigDB. We aim to publish this in April too so keep an eye out if you are still interested. I am closing this issue.

Cheers, Dharmesh

andreyurch commented 5 years ago

Hello Dharmesh,

Thank you very much for this update! I believe it will be very useful for all the users!

Best regards, Andrey

guidohooiveld commented 5 years ago

Hi, I saw your paper in BMC Bioinformatics, and then this thread.

Basically, the analysis of one of our data sets resembles situation c) mentioned above (thus aiming to test in a set of samples enrichment of multiple gene sets (MSigDB) for which the direction of regulation is unknown).

I first ran the example in the vignette, and that worked fine: scoredf <- simpleScore(rankData, upSet = tgfb_gs_up, downSet = tgfb_gs_dn)

Next I would like to apply/extent the analysis for a single gene set using the suggestions from Sepideh (i.e.: only provide upSet, setknowDirection = FALSE, and set centerScore = FALSE). However, I can not get this to work... Using/adapting the example from the vignette:

> scoredf2 <- simpleScore(rankData, upSet = tgfb_gs_up, downSet = NULL, knownDirection = FALSE, centerScore = FALSE)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘simpleScore’ for signature ‘"matrix", "GeneSet", "NULL"’
>

Am I doing something fundamentally wrong and is this to be expected, or is something else going on?

Finally, is the workflow Dharmesh mentioned above regarding the function multiScore already published? If so, a link to it would be appreciated.

Thanks for your nice work! Guido

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] singscore_1.4.0      GSEABase_1.46.0      graph_1.62.0        
 [4] annotate_1.62.0      XML_3.98-1.20        AnnotationDbi_1.46.0
 [7] IRanges_2.18.1       S4Vectors_0.22.0     Biobase_2.44.0      
[10] BiocGenerics_0.30.0

MomenehForoutan commented 5 years ago

Hi Guido, Thanks for using the singscore method! I guess you have just found a bug! It will actually work if you remove downSet = NULL from the function. Thanks for letting us know about the issue! We will look into this. Cheers, Sepideh

dodoflyy commented 5 years ago

Hi Guido, Thanks for using the singscore method! I guess you have just found a bug! It will actually work if you remove downSet = NULL from the function. Thanks for letting us know about the issue! We will look into this. Cheers, Sepideh

Hello, may I know when this bug will fixed? I also met exactly same bug. I will try remove downSet = NULL for now. Thanks.

NikolayAlabi commented 4 years ago

Hello, I have been trying to use SingScore, but have run into some problems with the simpleScore function. Firstly, my data matrix has over 20,000 rows of gene symbols, with the columns being TCGA patients. I tried both scaled data, and raw gene level counts for each gene, however, neither worked. I also tried making my geneset as a string vector or as a GeneSet object, however, the simplesScore function still returned absolutely nothing. I have uploaded my raw gene counts matrix, as well as my files that I read into character vectors for both the up and down gene sets. I am hoping you can test this out yourself and let me know where the issue may lie. I do receive a warning message that one of my genes in the upSet is not in the matrix, however, I do not believe that should cause an issue.

Here is the code that I have been trying:

validation_transposed <- read.csv("validation_transposed.csv", row.names=1) up = readLines("upset.csv") down = readLines("downset.csv") library(singscore) ranked <- rankGenes(validation_scaled_transposed) scoredf <- simpleScore(ranked, upSet = up, downSet = down)

scoredf [1] TotalScore TotalDispersion UpScore UpDispersion DownScore
[6] DownDispersion
<0 rows> (or 0-length row.names)

Here is also my session info:

sessionInfo() R version 4.0.1 (2020-06-06) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale: [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
[3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
[5] LC_TIME=English_Canada.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] singscore_1.8.0

loaded via a namespace (and not attached): [1] SummarizedExperiment_1.18.1 locfit_1.5-9.4 tidyselect_1.1.0
[4] purrr_0.3.4 reshape2_1.4.4 lattice_0.20-41
[7] colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2
[10] stats4_4.0.1 blob_1.2.1 XML_3.99-0.4
[13] rlang_0.4.6 pillar_1.4.4 glue_1.4.1
[16] DBI_1.1.0 BiocGenerics_0.34.0 bit64_0.9-7
[19] matrixStats_0.56.0 GenomeInfoDbData_1.2.3 lifecycle_0.2.0
[22] plyr_1.8.6 stringr_1.4.0 zlibbioc_1.34.0
[25] munsell_0.5.0 gtable_0.3.0 memoise_1.1.0
[28] Biobase_2.48.0 IRanges_2.22.2 GenomeInfoDb_1.24.2
[31] parallel_4.0.1 AnnotationDbi_1.50.1 GSEABase_1.50.1
[34] Rcpp_1.0.5 xtable_1.8-4 edgeR_3.30.3
[37] scales_1.1.1 limma_3.44.3 DelayedArray_0.14.0
[40] S4Vectors_0.26.1 graph_1.66.0 annotate_1.66.0
[43] XVector_0.28.0 bit_1.1-15.2 ggplot2_3.3.2
[46] digest_0.6.25 stringi_1.4.6 dplyr_1.0.0
[49] GenomicRanges_1.40.0 grid_4.0.1 tools_4.0.1
[52] bitops_1.0-6 magrittr_1.5 RCurl_1.98-1.2
[55] RSQLite_2.2.0 tibble_3.0.1 crayon_1.3.4
[58] tidyr_1.1.0 pkgconfig_2.0.3 ellipsis_0.3.1
[61] Matrix_1.2-18 rstudioapi_0.11 R6_2.4.1
[64] compiler_4.0.1 ` genesets.zip

validation_transposed.zip

Look forward to hearing from you, Nikolay

bhuvad commented 4 years ago

Hi Nikolay,

I had a look at the code and data you sent over and have located the issue. The problem is in the way you read in the genesets. The input should be a character vector, however, the way you are reading in the gene list is not producing this. In fact, it is reading in the comma-separated list as a single string (i.e. "GENE1, GENE2, GENE3", instead of "GENE1", "GENE2", "GENE3"). The code below should correct the import:

up = as.character(read.csv("upset.csv", header = FALSE))
down = as.character(read.csv("downset.csv", header = FALSE))

The other thing I'd like to point out is that you should not be using raw counts or log CPM. These summaries do not account for gene length biases which is really important to adjust for when using singscore. We suggest using RPKM/FPKM, TPM or RSEM values (or their log-transformations) as these adjust for gene length biases. Our F1000 paper shows how you can compute gene lengths for genes using biomart and use these to compute RPKM values (https://f1000research.com/articles/8-776).

Please feel free to ask further questions if necessary.

Cheers, Dharmesh

DavisLaboratory / singscore

Lack of functionality for many gene sets #7