egeulgen / pathfindR

pathfindR: Enrichment Analysis Utilizing Active Subnetworks
https://egeulgen.github.io/pathfindR/
Other
179 stars 25 forks source link

Issue with procuring MSigDB gene lists for mice #192

Closed t3h4nt1chr15t closed 7 months ago

t3h4nt1chr15t commented 8 months ago

Describe the bug According to the following vignette: https://cran.r-project.org/web/packages/pathfindR/vignettes/obtain_data.html , It is possible to bring in the mouse MSigDB gene lists for nonhuman use of pathfindR. The issue with using the script recommended here is that it asks for a species identifier as well as a collection when the collections between human and mice on MSigDB are distinctly different and have different names. All the mouse collections start with 'M' and simply giving it a 'H' or 'C' identifier like it suggests for humans, supposedly would pull the wrong gene lists. The obvious thing to do would be to put the mouse collection identifier here, but the function gives you an error specifying you can only put collections starting with an 'H' or 'C,' so it's unclear if this is based on an earlier MSigDB where the collections maybe didn't have unique names, or if it would successfully pull the mouse gene list only even if you give it the 'H' or 'C' identifier for the collection. I would assume it's based off an older versiopn of MSigDB, but only because it doesn't include C8 as a collection to pull from, suggesting it didn't exist in earlier versions. It would be a big shame for C8/M8 to not be allowed as a genelist, as it's one of the newer great resources for deconvoluting cell type in bulk-RNAseq.

What's also very strange is that there is no 'M7' for the mice, while there is for the humans. So if you tell it your species is mice yet to collect 'C7' as you're trusting it to pull the mouse version of that, it will in fact pull a unique gene list with mouse gene identifiers. I have no idea where it's getting this from though as MSigDB states there is no 'M7.'

Trying to compare lists that are shared between mouse and human like C2/M2, looking at the number of gene lists in the sets on their site vs the gene set pulled by PathfindR, the gene list numbers pulled are much closer to the number of gene lists in the human sets than the mouse sets, so it almost looks like it might be converting human gene names to mouse variants rather than pulling the actual mouse gene sets. This suspicion is further supported by the gene list descriptions containing descriptions found in the human sets, but not the mouse sets.

To Reproduce Steps to reproduce the behavior:

  1. Run the following function: 'gsets_list <- get_gene_sets_list(source = "MSigDB", species = "Mus musculus", collection = "MH")'
  2. See error 'Error in get_mgsigdb_gsets(species = species, collection = collection, : collection should be one of “H”, “C1”, “C2”, “C3”, “C4”, “C5”, “C6”, “C7”'

Expected behavior I would expect it to pull the mouse collection by giving it the mouse collection identifier.

Desktop (please complete the following information):

R Session Information: R version 4.3.2 (2023-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: America/Chicago tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] fastcluster_1.2.3 corrplot_0.92 Hmisc_5.1-1 rgl_1.2.8
[5] biomaRt_2.58.0 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[9] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[13] ggplot2_3.4.4 tidyverse_2.0.0 dplyr_1.1.4 pathfindR_2.3.0
[17] pathfindR.data_2.0.0

loaded via a namespace (and not attached): [1] rstudioapi_0.15.0 jsonlite_1.8.8 magrittr_2.0.3
[4] magick_2.8.2 modeltools_0.2-23 farver_2.1.1
[7] rmarkdown_2.25 zlibbioc_1.48.0 vctrs_0.6.5
[10] memoise_2.0.1 RCurl_1.98-1.13 base64enc_0.1-3
[13] htmltools_0.5.7 progress_1.2.3 curl_5.2.0
[16] broom_1.0.5 Formula_1.2-5 htmlwidgets_1.6.4
[19] cachem_1.0.8 igraph_1.6.0 lifecycle_1.0.4
[22] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.6-4
[25] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.11 [28] digest_0.6.33 colorspace_2.1-0 AnnotationDbi_1.64.1
[31] S4Vectors_0.40.2 RSQLite_2.3.4 filelock_1.0.3
[34] labeling_0.4.3 fansi_1.0.6 timechange_0.2.0
[37] httr_1.4.7 polyclip_1.10-6 compiler_4.3.2
[40] bit64_4.0.5 withr_2.5.2 doParallel_1.0.17
[43] htmlTable_2.4.2 backports_1.4.1 viridis_0.6.4
[46] DBI_1.2.0 ggforce_0.4.1 R.utils_2.12.3
[49] MASS_7.3-60 rappdirs_0.3.3 tools_4.3.2
[52] foreign_0.8-85 prabclus_2.3-3 nnet_7.3-19
[55] R.oo_1.25.0 glue_1.6.2 grid_4.3.2
[58] checkmate_2.3.1 cluster_2.1.4 generics_0.1.3
[61] gtable_0.3.4 tzdb_0.4.0 R.methodsS3_1.8.2
[64] class_7.3-22 data.table_1.14.10 hms_1.1.3
[67] tidygraph_1.3.0 xml2_1.3.6 utf8_1.2.4
[70] XVector_0.42.0 flexmix_2.3-19 BiocGenerics_0.48.1
[73] ggrepel_0.9.4 foreach_1.5.2 pillar_1.9.0
[76] vroom_1.6.5 robustbase_0.99-1 tweenr_2.0.2
[79] BiocFileCache_2.10.1 lattice_0.22-5 bit_4.0.5
[82] tidyselect_1.2.0 Biostrings_2.70.1 knitr_1.45
[85] gridExtra_2.3 IRanges_2.36.0 stats4_4.3.2
[88] xfun_0.41 graphlayouts_1.0.2 Biobase_2.62.0
[91] diptest_0.77-0 DEoptimR_1.1-3 stringi_1.8.3
[94] evaluate_0.23 codetools_0.2-19 kernlab_0.9-32
[97] ggraph_2.1.0 BiocManager_1.30.22 cli_3.6.2
[100] rpart_4.1.21 munsell_0.5.0 Rsubread_2.16.0
[103] Rcpp_1.0.11 GenomeInfoDb_1.38.5 dbplyr_2.4.0
[106] png_0.1-8 XML_3.99-0.16 parallel_4.3.2
[109] blob_1.2.4 prettyunits_1.2.0 mclust_6.0.1
[112] bitops_1.0-7 viridisLite_0.4.2 scales_1.3.0
[115] crayon_1.5.2 fpc_2.2-11 rlang_1.1.2
[118] KEGGREST_1.42.0

egeulgen commented 8 months ago

thank you for raising this! it seems that the function is a bit too stringent on validating the input, I'll try and revise the behaviour

t3h4nt1chr15t commented 8 months ago

Opening that up would be nice, but I'm also very concerned with the fact that I'm fairly certain it isn't actually pulling the mouse gene lists at all but just pulling the human gene lists and renaming the genes to the mouse variations. There are numerous examples of gene lists that are unique to the human gene lists, yet somehow found in the pulled mouse gene lists.

egeulgen commented 8 months ago

I'll investigate and keep you updated

egeulgen commented 7 months ago

investigated this and there's no need for any change at the moment. pathfindR uses the msigdbr R package internally (and further processes it:

    msig_df <- msigdbr::msigdbr(species = species, category = collection, subcategory = subcollection)

msigdbr expects categories as "H" etc:

>> msig_df <- msigdbr::msigdbr(species = "Mus musculus", category = "MH", subcategory = NULL)
Error in msigdbr::msigdbr(species = "Mus musculus", category = "MH", subcategory = NULL) : 
  unknown category

this does return human gene sets, but best to contact maintainers of msigdbr about it. My thinking is that they somehow do not have supporr for these mouse gene sets (and simply provide mouse-equivalent genes for these)

let me know if I can help further.

egeulgen commented 7 months ago

see https://github.com/igordot/msigdbr/issues/32

t3h4nt1chr15t commented 7 months ago

Sorry, I'm a bit confused by this result. MSigDB does in fact have fully curated gene sets for mice that are not the same as they are for humans. Using the ones for humans isn't an acceptable thing to do if I'm going to publish my data in mice.

I'm not as technically inclined in computer science as you are, but it looks as if you're saying that some connecting library or package that allows pathfindR to do what it does might be outdated and doesn't allow for collecting the mouse gene sets yet, but you do recognize it is collecting the human genes, not the mouse genes, as a result of this.

The mouse gene sets aren't mouse equivalents of the human genes. So are you saying this is just a msigdbr package issue where they haven't updated their own package for the new database yet? Or are you saying they don't lend enough trust to the validity of their mouse gene sets to allow them to be collected yet?

egeulgen commented 7 months ago

I understand your frustration and agree with you that this is very "improper" for msigdbr. You can open another issue in the msigdbr repo, linked above, and raise your rightful concern. From my understanding, they just haven't had the resources to update the package to support the mouse-specific gene sets, readily-available on MSigDB. sorry I couldn't help further but this falls out-of-scope for the main responsibilities of maintaining pathfindR.