egeulgen / pathfindR

pathfindR: Enrichment Analysis Utilizing Active Subnetworks
https://egeulgen.github.io/pathfindR/
Other
178 stars 25 forks source link

smaller number of pathways calculated with score_terms than identified with cluster_enriched_terms #47

Closed EmmaRuiz closed 4 years ago

EmmaRuiz commented 4 years ago

Describe the bug Good morning,

I thank you for all your different tutorials. But I observed that with my data, run_pathfindR () and cluster_enriched_terms () identified 98 KEGG signaling pathway. But when I want to obtain the aggregates scores for each pathway with score_terms (), the matrix results gives just the scores for 53 pathways.

I tried to compare both results and see if it was depedent of the p-value or cluster results of the RA_clustered file. But I haven't found anything that can explain why score_terms () selects a set of pathway even when do not precise anything.

Here is my code :

RA_output <- run_pathfindR(PROGvsPRE.GSE99898, output="C:/Users/remmanuelle/Documents/Bioinformatics/Signaling.pathway.HMOX1.melanoma/output_PROGvsPRE.GSE99898" )

RA_clustered <- cluster_enriched_terms(RA_output, method="hierarchical", use_description=TRUE)

score_matrix <- score_terms(enrichment_table = RA_clustered, exp_mat = RA_exp_mat, use_description = TRUE)

Thank you for your help,

Sincerely

To Reproduce Steps to reproduce the behavior:

  1. Prepare input as '...'
  2. Run the following function: '....'
  3. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

R Session Information: Please provide the R session information (by running sessionInfo())

sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] pathfindR_1.5.0.9010

loaded via a namespace (and not attached): [1] colorspace_1.4-1 ellipsis_0.3.1 class_7.3-17 modeltools_0.2-23 mclust_5.4.6 rprojroot_1.3-2 XVector_0.28.0
[8] fs_1.4.1 farver_2.0.3 remotes_2.1.1 graphlayouts_0.7.0 ggrepel_0.8.2 flexmix_2.3-15 bit64_0.9-7
[15] AnnotationDbi_1.50.0 fansi_0.4.1 codetools_0.2-16 R.methodsS3_1.8.0 doParallel_1.0.15 robustbase_0.93-6 knitr_1.28
[22] polyclip_1.10-0 pkgload_1.1.0 cluster_2.1.0 kernlab_0.9-29 png_0.1-7 R.oo_1.23.0 graph_1.66.0
[29] ggforce_0.3.1 compiler_4.0.0 httr_1.4.1 backports_1.1.7 assertthat_0.2.1 cli_2.0.2 tweenr_1.0.1
[36] htmltools_0.4.0 prettyunits_1.1.1 tools_4.0.0 igraph_1.2.5 gtable_0.3.0 glue_1.4.1 dplyr_1.0.0
[43] Rcpp_1.0.4.6 Biobase_2.48.0 vctrs_0.3.0 Biostrings_2.56.0 iterators_1.0.12 fpc_2.2-5 ggraph_2.0.3
[50] xfun_0.14 stringr_1.4.0 ps_1.3.3 testthat_2.3.2 lifecycle_0.2.0 pak_0.1.2 devtools_2.3.0
[57] XML_3.99-0.3 org.Hs.eg.db_3.11.4 DEoptimR_1.0-8 MASS_7.3-51.6 zlibbioc_1.34.0 scales_1.1.1 tidygraph_1.2.0
[64] parallel_4.0.0 KEGGgraph_1.48.0 yaml_2.2.1 curl_4.3 memoise_1.1.0 gridExtra_2.3 ggplot2_3.3.1
[71] stringi_1.4.6 RSQLite_2.2.0 highr_0.8 S4Vectors_0.26.1 desc_1.2.0 foreach_1.5.0 BiocGenerics_0.34.0 [78] pkgbuild_1.0.8 rlang_0.4.6 pkgconfig_2.0.3 prabclus_2.3-2 bitops_1.0-6 evaluate_0.14 lattice_0.20-41
[85] purrr_0.3.4 labeling_0.3 bit_1.1-15.2 processx_3.4.2 tidyselect_1.1.0 magrittr_1.5 R6_2.4.1
[92] IRanges_2.22.2 magick_2.3 generics_0.0.2 DBI_1.1.0 pillar_1.4.4 withr_2.2.0 KEGGREST_1.28.0
[99] RCurl_1.98-1.2 nnet_7.3-14 tibble_3.0.1 crayon_1.3.4 rmarkdown_2.2 viridis_0.5.1 usethis_1.6.1
[106] grid_4.0.0 blob_1.2.1 callr_3.4.3 digest_0.6.25 diptest_0.75-7 tidyr_1.1.0 R.utils_2.9.2
[113] stats4_4.0.0 munsell_0.5.0 viridisLite_0.3.0 sessioninfo_1.1.1

Additional context Add any other context about the problem here. While pathfindR is an R package, the active subnetwork search functionality is written in Java. If you suspect any issue regarding java please provide your Java version (by running java --version)

egeulgen commented 4 years ago

hey @EmmaRuiz,

In score_terms, you have to use the expression matrix you used to obtain PROGvsPRE.GSE99898 instead of the built-in RA_exp_mat for rheumatoid arhritis data. Let me know if it persists when you use the correct expression matrix data

EmmaRuiz commented 4 years ago

Good evening, Thank you for answer,

Here is the RA_output in .rds extension and the expression matrix.

Thank you for your help,

Emmanuelle

Le mer. 3 juin 2020 à 15:38, Ege Ulgen notifications@github.com a écrit :

hey @EmmaRuiz https://github.com/EmmaRuiz,

Do you mind sharing RA_output as an RDS file so I can try to reproduce the issue

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/egeulgen/pathfindR/issues/47#issuecomment-638448517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP2PZSB6CWS6OQJ4GS3GPVDRU2YDFANCNFSM4NR7HSWA .

egeulgen commented 4 years ago

hey @EmmaRuiz,

can you share the attachments directly from GitHub?

EmmaRuiz commented 4 years ago

RA_output.rds.zip Good morning, attached is the RA_output.

I used the correct expression matrix.

Thanks for your help,

Emmanuelle

egeulgen commented 4 years ago

can you also share the expression matrix you used?

EmmaRuiz commented 4 years ago

GSE99898.data.genes.SPA.zip Attached is the expression matrix used.

Thanks

Emmanuelle

egeulgen commented 4 years ago

With the data you shared and the script below, the number of pathways appear to be the same (as shown below)

Because of the way score_terms() works, some pathways may be discarded because there are few or no input genes involved. However, 53 out of 98 would be unusual.

I'm certain that you should be getting the same results as I do if you follow the script below. If not, let me know and I will see if this may be a Windows-related issue.

Best, -E

library(pathfindR)

exp_mat <- read.delim("misc/issues/issue47/GSE99898.data.genes.SPA/GSE99898.data.genes.SPA.txt", row.names = 1)
exp_mat <- as.matrix(exp_mat)
output_df <- readRDS("misc/issues/issue47/GSE99898.data.genes.SPA/issue47_output.rds")

clustered_df <- cluster_enriched_terms(output_df, method = "hierarchical", use_description = TRUE)
dim(clustered_df)
[1] 100  10

score_matrix <- score_terms(enrichment_table = clustered_df, exp_mat = exp_mat, use_description = TRUE)
dim(score_matrix)
[1] 100  30
EmmaRuiz commented 4 years ago

It succeed. I think there were a typo error in how i called the expression matrix. But as it was different the number of pathways was different with every different analysis, I didn't see it.

Thank you for your answer and your help,

I appreciate it a lot.

Emmanuelle