egeulgen / pathfindR

pathfindR: Enrichment Analysis Utilizing Active Subnetworks
https://egeulgen.github.io/pathfindR/
Other
178 stars 25 forks source link

Ability to set Background geneset #137

Closed Rohit-Satyam closed 2 years ago

Rohit-Satyam commented 2 years ago

Hi

I was trying out your package and I found that the run_pathfindR function don't have an argument to set background /Universe geneset. Since in experiments like scRNASeq there is a fallout of large number of genes, I believe it is necessary to limit background gene set size before enrichment analysis. How does pathfindR takes care of that. Should I remove the genes that were not captured/dropouts or not expressed in my experiment from custom_genes = pfa_kegg_genes

Besides, I wish to understand the rationale of this step, where we filter interactions based on combined_score when using non-human organism.

## filter using combined_score cut-off value of 800
mmu_string_df <- mmu_string_df[mmu_string_df$combined_score >= 800, ]

This is because using this threshold leads to lot of genes being absent from the PIN. When I retain all the interactions, this percentage reduced to 11%

Could not find any interactions for 74 (34%%) genes in the PIN
egeulgen commented 2 years ago

Hello,

Yes, I would suggest removing dropouts from custom_genes.

The rationale for filtering STRING based on the combined score is to have high-confidence interactions, you may choose to use a lower threshold to have 'medium' or better confidence (suggested to be 400, see this article)

Rohit-Satyam commented 2 years ago

I wrote this function. It's not neat but could be used to subset the gsets_list given a vector of genes that you intend to keep in your background. Please include it in your next release if you find it useful and valid.

retainExpressedOnly <- function(gsets,univ){
  ## Tidy Up Gene Names
  gset.tidy <- lapply(1:length(gsets$gene_sets), function(x){
    gsub("\\..*","", gsets$gene_sets[[x]])
  })

  univ.tidy <-gsub("\\..*","", univ)

  ## Subsetting the gset based on background list provided
  gset.subset <- lapply(1:length(gset.tidy),function(x){
    gset.tidy[[x]][gset.tidy[[x]] %in% univ.tidy]
  })

  names(gset.subset) <- names(gsets$gene_sets)

  ## Some pathways might be empty so remove them
  vec <- lengths(gset.subset)!=0

  gset.subset <- gset.subset[vec]
  kegg_descriptions <- gsets$descriptions[vec]
  res <- list(gene_sets=gset.subset,descriptions=kegg_descriptions)
  return(res)
}

Usage:

retainExpressedOnly(gsets = gsets_list, univ = p9_bk)

> p9_bk %>% head()
[1] "PF3D7_0102200" "PF3D7_0104300" "PF3D7_0104500" "PF3D7_0106400" "PF3D7_0106700" "PF3D7_0107500"