FertigLab / CoGAPS

Bayesian MCMC matrix factorization algorithm
https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html
BSD 3-Clause "New" or "Revised" License
61 stars 17 forks source link

GSEA algorithms for the function getPatternHallmarks #91

Closed LiuCanidk closed 4 months ago

LiuCanidk commented 5 months ago

I found the plot generated by plotPatternHallmarks is less satisfactory. So I planned to perform the GSEA additionally to draw the enrichment plot individually and also plot the GSEA enrichment curve for specific interesting pathways. Also so that I can perform GO GSEA or KEGG GSEA with more freely.

But I found non-repeatable results of Hallmark pathway GSEA, using other GSEA tools like the R package BioEnricher. So I guess there are some paramters I did not know that caused this non-repeatbale result. In the tutorial, it said the input for getPatternHallmarks was the result of patternMarkers. I have confusions as follows:

dimalvovs commented 5 months ago

@LiuCanidk thanks for your questions. Please note that getPatternHallmarks is discontinued in the current version (see #92) and replaced with getPatternGeneSet method to aid stability, eliminate dependency on possibly outdated packages, and give users more freedom regarding the choice of hallmarks. Do you want to check that out?

LiuCanidk commented 5 months ago

@dimalvovs of course! Should I remove the old package and install the developmental version on the github? Then I would check out the new function and provide feedback.

dimalvovs commented 5 months ago

yes, that's the way

LiuCanidk commented 5 months ago

@dimalvovs , I try to use the new function. But first I need an order ranked gene list. So I plan to obtain this from the result of patternMarkers. But I found strange results that the rank and the score of genes has a positive correlation. As stated in the tutorial, lower rank represent higher association with specific patterns and higher score represent higher association. They should be negatively correlated.

code

#already with a cogaps result
pm <- patternMarkers(cogapsresult, threshold="cut")
save(pm, file='05.NMF_expression_program/NMF_CoGAPS_marker.rda')
#check marker rank
head(pm$PatternMarkers[[1]])
head(pm$PatternMarkerRanks[pm$PatternMarkers[[1]],])
#check marker score
head(pm$PatternMarkerScores[pm$PatternMarkers[[1]],])
#check the correlation between rank and score
cor(pm$PatternMarkerRanks[,1], pm$PatternMarkerScores[,1])

result:

image

Also, I did find that the rank of marker genes in pm$PatternMarkerRanks were not corresponding to the rank in the pm$PatternMarkers one by one. As you can see in the screenshot, gene PNMA8B is not NO.2 but NO.3.

So this made me confusing about how to choose the gene rank when perform the enrichement test in the getPatternGeneSet function.

LiuCanidk commented 5 months ago

@dimalvovs sorry, I may misunderstand the use of getPatternGeneSet, the geneset parameter should be some pathway gene list, such as Hallmark, KEGG GO etc. This was a little bit misleading.... And also, I recommend to provide additional choice for gmt file input, which can be easily downloaded from MsigDB.

LiuCanidk commented 4 months ago

@dimalvovs the new function is great. Only note that avoid the misleading use of geneset parameter as I mentioned above. Maybe providing the parameter choice of KEGG GO or Hallmark pathway list is more straightforward and user-friendly. In addition, for GSEA, the GSEA curve plot is necessary for specific pathway visualization (see below). Considering the usage of clusterProfiler function gseaplot and database R package msigdbr may help.

GSEA curve plot

image

Without provided list, my code was as follows:

#Enrichment analysis
library(clusterProfiler)
#readin the hallmark pathway gmt file
hallmark=read.gmt('F:/MSigDB/h.all.v2023.2.Hs.symbols.gmt')
#convert this dataframe to a gene list with names corresponding to each pathway
hallmark.list=list()
for(i in unique(hallmark$term)){
  term=tolower(gsub('HALLMARK_','',i))
  hallmark.list[[term]]=hallmark$gene[hallmark$term==i]
}
#perform the GSEA
ORA.hallmark=getPatternGeneSet(cogapsresult,
                      hallmark.list,
                      'enrichment')
#draw plot based on padj & NES, top10 pathways
for(i in 1:7){
  ORA.hallmark[[i]]=ORA.hallmark[[i]][order(ORA.hallmark[[i]]$padj),][1:10,]
  ORA.hallmark[[i]]$pathway=factor(ORA.hallmark[[i]]$pathway,
                                   levels = ORA.hallmark[[i]]$pathway)
  assign(paste0('p',i),
         ggplot(ORA.hallmark[[i]], aes(-log10(padj),pathway,fill=NES))+
           geom_col()+theme_bw(base_size = 12)+
           xlab(expression(-log[10])~padj)+ylab(NULL)+
           geom_vline(xintercept = -log10(0.0005), lty=2)+
           scale_x_continuous(expand = c(0,0))+
           scale_fill_continuous(low='lightblue',high='indianred')+
           geom_text(data = ORA.hallmark[[i]],
                     aes(x = 0.05, 
                         y = pathway,
                         label = pathway),
                     size = 4,
                     hjust = 0)+
           theme(axis.text.y = element_blank()))
}
#arrange and save PDF
library(ggpubr)
ggsave('05.NMF_expression_program/GSEA_hallmark.pdf',
       ggarrange(p1,p2,p3,p4,p5,p6,p7,
                    nrow=3, ncol=3, common.legend = T),
       width = 32, height = 24, units = 'cm')

plot result

image

plotPatternGeneSet result

image

Obviously, maybe providing a function that has better visualization than plotPatternGeneSet is highly recommended.

dimalvovs commented 4 months ago

adding @jmitchell81

jmitchell81 commented 4 months ago

Hello @LiuCanidk. Thank you for your feedback on the gene set plotting functions. I've created a branch implementing your suggestion for scaling the color of the bars by enrichment score and moving the gene set names to overlays on the bars on my own fork of the package that we may be able to implement soon.

In regards to your question about implementation of the GSEA curve plot, you could generate an enrichment plot by using the column for your pattern of interest as your test statistic vector passed to the plotting function in place of a t-statistic or logFoldChange derived from a differential expression test. For the interpretation of these pattern "amplitude" values, higher values indicate a greater association of the gene with the given pattern

cogapsresult@featureLoadings[["Pattern_1"]]

LiuCanidk commented 4 months ago

@jmitchell81 Thanks for your reply!