ctlab / fgsea

Fast Gene Set Enrichment Analysis
Other
379 stars 67 forks source link

Duplicated genes in $leadingEdge output from fgsea() #99

Closed mgrantpeters closed 3 years ago

mgrantpeters commented 3 years ago

Hi, I am having an issue with the output for the fgsea function where a row in the leadingEdge column will sometimes have the same gene multiple times (e.g.: Pathway 1 - HLA-E HLA-E HLA-E HLA-E HLA-E IL6ST VCAM1 ADCYAP1). I have checked my input tables and do not have any duplicate input entries. Is this something that has been previously observed/is there a reason for it?

Thank you,

Melissa

assaron commented 3 years ago

That's strange. Can you share the code how you do it?

mgrantpeters commented 3 years ago

Thank you for the quick response.

I'm sending in attachment a summarised version of the code and a print screen of the sgseares$leadingEdge output. I have been working with gene symbols rather than EntrezIDs or EnsemblIDs since each gene has multiple possible EntrezIDs or EnsemblIDs and I do not have enough data to infer which ID is correct in this case. Please let me know if further information about file inputs would be helpful.

All the best,

Melissa

On Fri, Jul 16, 2021 at 4:43 PM Alexey Sergushichev < @.***> wrote:

That's strange. Can you share the code how you do it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ctlab/fgsea/issues/99#issuecomment-881542209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ47ZJIIJEZNBS543TAEEJ3TYBHTNANCNFSM5APXB6SA .

library(msigdbr) library(ggplot2) library(fgsea) library(data.table) library(dplyr) library(tidyverse)

generef = msigdbr(species = "human")

Include only DBs of interest

generef<-generef[(generef$gs_subcat=="CP:KEGG" | generef$gs_subcat=="CP:REACTOME" | generef$gs_subcat=="IMMUNESIGDB" | generef$gs_subcat=="GO:BP"| generef$gs_subcat=="GO:MF"| generef$gs_cat=="H"),]

Keep gene_symbols

generef = split(x = generef$gene_symbol, f = generef$gs_name)

out_dir = sprintf("path/%s/results", bucket[m]) files <- list.files(path=sprintf("path/%s/", bucket[m]), pattern = "scores*") rangef = 1:length(files) for (n in rangef) {

Load file

cluster_counts = read.csv(sprintf("path/%s/%s", bucket[m], files[n]), sep="\t", row.names = 1, header = TRUE)

#Rank to deal with ties in scores
genesTables <- cluster_counts %>% mutate(rank = rank(scores,  ties.method = "random")) 

#Check no duplicated genes
print(sum(duplicated(row.names(cluster_counts))))

stats = genesTables$rank
names(stats) = row.names(cluster_counts)

#Run fgsea
fgseares = fgsea(generef, stats, nperm=10000, maxSize=500)
topPathwaysUp <- fgseares[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseares[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))

#Save figure top10 up and top10down
pdf(file.path(out_dir, sprintf("topUpTopDownPathway_%s.pdf", tools::file_path_sans_ext(files[n]))), height = 10, width = 22)
plotGseaTable(generef[topPathways], stats, fgseares, 
                gseaParam=0.5)
dev.off()

#save.csv of results
fwrite(fgseares, file = sprintf("/well/dendrou/projects/snRNA-seq_analyses/data/meta_jsd/pipe_int_postfilt_demult_sample_id_20210616/fgsea_analysis/%s/results/fgsearesults_%s.csv", bucket[m], tools::file_path_sans_ext(files[n])))
lapply(fgseares[ES > 0][head(order(pval), n=10), leadingEdge], write, sprintf("/well/dendrou/projects/snRNA-seq_analyses/data/meta_jsd/pipe_int_postfilt_demult_sample_id_20210616/fgsea_analysis/%s/results/fgsearesults_top10genesUp_%s.csv", bucket[m], tools::file_path_sans_ext(files[n])), append=TRUE, ncolumns=10)
lapply(fgseares[ES < 0][head(order(pval), n=10), leadingEdge], write, sprintf("/well/dendrou/projects/snRNA-seq_analyses/data/meta_jsd/pipe_int_postfilt_demult_sample_id_20210616/fgsea_analysis/%s/results/fgsearesults_top10genesDown_%s.csv", bucket[m], tools::file_path_sans_ext(files[n])), append=TRUE, ncolumns=10)

}

assaron commented 3 years ago

Can't see what could be wrong from the code alone. Please, make a small reproducible example: select the particular fgsea call and save the inputs into and .rda file with save(). Also please report your fgsea installed version. Actually, you can try to install the latest version first from github with "devtools::install_github('ctlab/fgsea')" and check that the problem persists.

mgrantpeters commented 3 years ago

Hi Alexey,

Thank you very much for the help - it seems like it was a version issue. Following updating with "devtools::install_github('ctlab/fgsea')" I went from version 1.12.0 to 1.19.2. Now I both do not have duplicated gene and have much more extensive gene lists per pathway. This makes more sense based on what I expected. Thank you very much for your help and taking the time to reply.

All the best,

Melissa

On Fri, Jul 16, 2021 at 7:19 PM Alexey Sergushichev < @.***> wrote:

Can't see what could be wrong from the code alone. Please, make a small reproducible example: select the particular fgsea call and save the inputs into and .rda file with save(). Also please report your fgsea installed version. Actually, you can try to install the latest version first from github with "devtools::install_github('ctlab/fgsea')" and check that the problem persists.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ctlab/fgsea/issues/99#issuecomment-881632896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ47ZJKVXRSKODT6PQCAD6DTYBZ45ANCNFSM5APXB6SA .