RGLab / MAST

Tools and methods for analysis of single cell assay data in R
227 stars 57 forks source link

gseaAfterBoot enriched genes #154

Open EBosi opened 3 years ago

EBosi commented 3 years ago

Dear all, thanks for the terrific tool. I've been analysing my dataset following the excellent MAIT tutorial (https://www.bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAITAnalysis.html). Considering the gene set enrichment analysis, methods from the GSEA family (GSEA, fGSEA, etc) provide, in addition to the enrichment score and significance, also a list of genes contributing to the enrichment (leadingEdge). I was wondering if this kind of information can actually be derived using gseaAfterBoot.

I've been trying to run the source of gseaAfterBoot line by line but I encountered a number of dependency errors of internal functions of the library (eg functions from GSEA-by-boot.R cannot be found). Is there a better way to tinker with the MAST functions/items?

I hope I was clear, I'm looking forward to your reply. Emanuele

amcdavid commented 3 years ago

The function name is probably a misnomer, because it performs a competitive test like camera in the edgeR package. So there's not a direct way to perform the leading edge analysis as done in the GSEA of Subramanian, et al 2005. But I imagine you could do something reasonable that replicates this by intersecting genes in a set with the ranks from the bootstrap (signed log10 p values or Z scores). Sounds like an interesting and useful addition.

To step through gseaAfterBoot you might clone this repo from github, then devtools::load_all() to import its internal functions.

EBosi commented 3 years ago

Hi Andrew, sorry for coming back in the discussion after such long time. Thank you very much for your reply, I wanted to work a bit on this issue, could you please clarify what do you mean by intersecting the genes in a set with the ranks from the bootstrap? Thanks again, Emanuele

amcdavid commented 3 years ago

As a proof-of-concept, you might use the signed hurdle p-values, eg, from the summary method, multipying the logFC with the -log10(p.value) before worrying about bootstrapping, which is only important if you want to deal with gene-gene correlations. There would be no complication to apply the typical leading edge analysis of Subramanian, et al 2005 with this quantity. I know clusterProfiler has this analysis implemented.

Working with the bootstraps, which would only be necessary if you are worried about gene-gene correlations, would be much more complicated--definitely beyond the scope of random thoughts on a github issues page. You would need to modify code to work bootstat the matrix of bootstrapped coefficients. Most of the rest of the code in this function is specialized to the case that we are summing across genes in the set rather than looking at individual genes. Deriving the impact of gene-gene correlations on variance of the rank-order of the genes in the set sound pretty complicated!

EBosi commented 3 years ago

Hi Andrew, thanks for the clarification, I will do as you suggest. I tried to tackle the bootstrap array object, but with little knowledge of the embedded items I was really struggling. It would be nice to have this addition tho if that's something that can be done with ease (relatively), as it's less advantageous to use the set enrichment method of MAST over others available (GSEA, clusterProfiler, etc.). Thank you so much for you support! Best, Emanuele

amcdavid commented 3 years ago

Unfortunately, this would not be easy -- the competitive gene set test is really a much different beast than the GSEA of Subramanian. More like a project for a PhD student than a couple hours' work.