egeulgen / pathfindR

pathfindR: Enrichment Analysis Utilizing Active Subnetworks
https://egeulgen.github.io/pathfindR/
Other
177 stars 25 forks source link

pathFindR get_gene_set_list() gives wrongly formatted results for non-model organism #205

Closed Rohit-Satyam closed 2 months ago

Rohit-Satyam commented 2 months ago

Describe the bug The bug is when I run the following code I get a list of pathways with gene descriptions rather than gene IDs for Pfalciparum

> gsets_list$gene_sets
$pfa01100
 [1] "aminomethyltransferase" "hexokinase"             "arginase"               "phosphoglucomutase"    
 [5] "ferrochelatase"         "chitinase"              "allantoicase"           "phosphomannomutase"    
 [9] "dihydroorotase"         "acylphosphatase"        "enolase"                "transketolase"         
[13] "adenosylhomocysteinase" "CYTB"                   "coxIII"                 "coi"                   
[17] "ferredoxin"            

$pfa01110
[1] "transketolase"          "hexokinase"             "phosphoglucomutase"     "ferrochelatase"         "phosphomannomutase"    
[6] "aminomethyltransferase" "enolase"                "arginase"          

rather than returning PF3D7 IDs. This organism doesn't have proper gene symbols so mostly PF3D7 ensemble IDs are frequently used

To Reproduce

gsets_list <- get_gene_sets_list(source = "KEGG",
                                 org_code = "pfa")

Expected behavior The expected behavior is a list of lists containing the ensemble gene IDs

$pfa00010
 [1] "PF3D7_0624000" "PF3D7_1436000" "PF3D7_0915400" "PF3D7_1444800" "PF3D7_1439900" "PF3D7_0318800" "PF3D7_1462800"
 [8] "PF3D7_0922500" "PF3D7_1120100" "PF3D7_1015900" "PF3D7_1037100" "PF3D7_0626800" "PF3D7_1124500" "PF3D7_1446400"
[15] "PF3D7_1020800" "PF3D7_1232200" "PF3D7_0815900" "PF3D7_0627800" "PF3D7_1012500" "PF3D7_1342800"

$pfa00020
 [1] "PF3D7_1022500" "PF3D7_1342100" "PF3D7_1345700" "PF3D7_0820700" "PF3D7_1320800" "PF3D7_1232200" "PF3D7_0815900"
 [8] "PF3D7_1108500" "PF3D7_1431600" "PF3D7_1034400" "PF3D7_1212800" "PF3D7_0927300" "PF3D7_0616800" "PF3D7_1342800"
[15] "PF3D7_1124500" "PF3D7_1446400" "PF3D7_1020800"

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

R Session Information:

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8    LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Riyadh
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pathfindR_2.4.0      pathfindR.data_2.1.0

loaded via a namespace (and not attached):
 [1] viridis_0.6.5      utf8_1.2.4         generics_0.1.3     tidyr_1.3.1        digest_0.6.35      magrittr_2.0.3    
 [7] evaluate_0.23      grid_4.3.1         iterators_1.0.14   fastmap_1.1.1      foreach_1.5.2      doParallel_1.0.17 
[13] ggrepel_0.9.5      gridExtra_2.3      purrr_1.0.2        fansi_1.0.6        viridisLite_0.4.2  scales_1.3.0      
[19] tweenr_2.0.3       codetools_0.2-20   cli_3.6.2          rlang_1.1.3        graphlayouts_1.1.1 polyclip_1.10-6   
[25] tidygraph_1.3.1    munsell_0.5.1      withr_3.0.0        cachem_1.0.8       tools_4.3.1        parallel_4.3.1    
[31] memoise_2.0.1      dplyr_1.1.4        colorspace_2.1-0   ggplot2_3.5.1      vctrs_0.6.5        R6_2.5.1          
[37] lifecycle_1.0.4    MASS_7.3-60.0.1    ggraph_2.2.1       pkgconfig_2.0.3    pillar_1.9.0       gtable_0.3.5      
[43] glue_1.7.0         Rcpp_1.0.12        ggforce_0.4.2      xfun_0.43          tibble_3.2.1       tidyselect_1.2.1  
[49] rstudioapi_0.16.0  knitr_1.46         farver_2.1.1       htmltools_0.5.8.1  igraph_2.0.3       rmarkdown_2.26    
[55] compiler_4.3.1
Rohit-Satyam commented 2 months ago

Kindly note that this issue is limited to current release you pushed 4 days ago and is resolved by downgrading the package to version pathfindR.data_2.1.0 and pathfindR_2.3.1. The old function was faster as well.

Rohit-Satyam commented 2 months ago

Besides, I observed that pathFindR is discarding many genes in my analysis. We know from experience that the PPI interaction in the string for plasmodium is sparse and is mostly based on coexpression (also see this discussion). This leads to filtering of most of the interaction even when I use a lower combined_score cut-off value of 400 and thereby discards nearly 43% of my genes (see log below) when I run pathfindR So do these genes that are not found in PIN undergo enrichment analysis or are discarded?

Number of genes in input after p-value filtering: 1412
pathfindR cannot handle p values < 1e-13. These were changed to 1e-13
Could not find any interactions for 604 (42.78%) genes in the PIN
Final number of genes in input: 808

So is it like pathfindR is not useful for cases/organisms where the interaction data is not well established?

egeulgen commented 2 months ago

Hey @Rohit-Satyam.

Related to the bug, I might have introduced it when I updated the relevant function in the last release, will investigate and try to resolve it.

Regarding your second comment, pathfindR is a tool for active-subnetwork search and then enrichment. As such, it does have a limitation that it will not be able perform as well in cases with lower number interactions in the protein interaction network, e.g. in your case for Plasmodium. However, the results should nonetheless be reliable.

egeulgen commented 2 months ago

This is a case of over-engineering a function, it first fetches the KEGG IDs for pathway genes from KEGG, then tries to convert the KEGG IDs to gene names using other data from KEGG. The conversion is not its responsibility. Therefore, I will remove the conversion functionality and return KEGG IDs (per previous behavior as well), so the user can convert the identifiers using a more appropriate tool (e.g. biomart) if they wish. I'll update here once the implementation is finished.

egeulgen commented 2 months ago

The fixed function is now available in the development version of pathfindR, you can install this via:

install.packages("devtools") # if you have not installed "devtools"
devtools::install_github("egeulgen/pathfindR")

I will try to release a patch version (i.e. pathfindR 2.4.1) on CRAN soon.