egeulgen / pathfindR

pathfindR: Enrichment Analysis Utilizing Active Subnetworks
https://egeulgen.github.io/pathfindR/
Other
177 stars 25 forks source link

dbl p-values result in error: "p values must all be numeric" #158

Closed johnsonra closed 1 year ago

johnsonra commented 1 year ago

Bug description

When p-values are formatted as <dbl> (specifically in the context of a tibble), run_pathfindR() generates the following error:

# Error in pathfindR::input_testing(input, p_val_threshold) : 
#   p values must all be numeric

To Reproduce

library(dplyr)
library(pathfindR)

dat <- tibble(gene = c("FAM110A", "RNASE2", "S100A8", "S100A9", "TEX261",
                       "ARHGAP17", "NUP62", "MYL6B", "BLOC1S1", "PCBP1" ),
              logFC = c( 0.073, 0.651, 0.620, -0.738, 0.509, 0.742, 0.615, -0.412, -0.538, 0.133),
              adj.P.Val = c(0.0915, 0.0385, 0.0400, 0.0344, 0.0448, 0.0325, 0.0400, 0.0513, 0.0420, 0.0836))

dat

# A tibble: 10 × 3
#    gene      logFC adj.P.Val
#    <chr>     <dbl>     <dbl>
#  1 FAM110A   0.073    0.0915
#  2 RNASE2    0.651    0.0385
#  3 S100A8    0.62     0.04  
#  4 S100A9   -0.738    0.0344
#  5 TEX261    0.509    0.0448
#  6 ARHGAP17  0.742    0.0325
#  7 NUP62     0.615    0.04  
#  8 MYL6B    -0.412    0.0513
#  9 BLOC1S1  -0.538    0.042 
# 10 PCBP1     0.133    0.0836

run_pathfindR(dat)

# `n_processes` is set to `iterations` because `iterations` < `n_processes`
# ## Testing input
# Error in pathfindR::input_testing(input, p_val_threshold) : 
#   p values must all be numeric

Work around

run_pathfindR(as.data.frame(dat))

# `n_processes` is set to `iterations` because `iterations` < `n_processes`
# There is already a directory named "pathfindR_Results".
# Writing the result to "pathfindR_Results(9)" not to overwrite any previous results.
# ## Testing input
# The input looks OK
# ## Processing input. Converting gene symbols,
# if necessary (and if human gene symbols provided)
# Number of genes provided in input: 10
# Number of genes in input after p-value filtering: 7
# Found interactions for all genes in the PIN
# Final number of genes in input: 7
# ## Performing Active Subnetwork Search and Enrichment
# ## Processing the enrichment results over all iterations
# ## Annotating involved genes and visualizing enriched terms
# .
# .
# .
# Output created: conversion_table.html
# Plotting the enrichment bubble chart
# Found 2 enriched terms
# 
# Enrichment results and table of converted genes 
# can be found in "results.html" 
# in the folder "/Users/johnsonra/Documents/Analysis/IDSS-23-Malaria-SomaScan/pathfindR_Results(9)"
# 
# Run cluster_enriched_terms() for clustering enriched terms
# 
# 
# ID        Term_Description Fold_Enrichment occurrence   support   lowest_p  highest_p
# 1 hsa04657 IL-17 signaling pathway        69.57764          7 0.4285714 0.01562062 0.03115759
# 2 hsa04530          Tight junction        20.00357          1 0.1111111 0.04737008 0.04737008
# Up_regulated Down_regulated
# 1       S100A8         S100A9
# 2     ARHGAP17 

Expected behavior

I would expect run_pathfindR() to treat <dbl> vectors the same as <num>.

Desktop:

R Session Information:

sessionInfo()

# R version 4.2.2 (2022-10-31)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Big Sur ... 10.16
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] pathfindR_1.6.4      pathfindR.data_1.1.3 dplyr_1.1.0         
# 
# loaded via a namespace (and not attached):
#  [1] ggrepel_0.9.3          Rcpp_1.0.10            tidyr_1.3.0           
#  [4] Biostrings_2.66.0      png_0.1-8              digest_0.6.31         
#  [7] foreach_1.5.2          utf8_1.2.3             ggforce_0.4.1         
# [10] GenomeInfoDb_1.34.9    R6_2.5.1               stats4_4.2.2          
# [13] RSQLite_2.3.0          evaluate_0.20          httr_1.4.5            
# [16] ggplot2_3.4.1          pillar_1.8.1           zlibbioc_1.44.0       
# [19] rlang_1.1.0            rstudioapi_0.14        blob_1.2.4            
# [22] S4Vectors_0.36.2       rmarkdown_2.20         RCurl_1.98-1.10       
# [25] igraph_1.4.1           polyclip_1.10-4        bit_4.0.5             
# [28] munsell_0.5.0          compiler_4.2.2         xfun_0.37             
# [31] pkgconfig_2.0.3        BiocGenerics_0.44.0    htmltools_0.5.4       
# [34] tidyselect_1.2.0       KEGGREST_1.38.0        GenomeInfoDbData_1.2.9
# [37] tibble_3.2.1           gridExtra_2.3          IRanges_2.32.0        
# [40] codetools_0.2-18       graphlayouts_0.8.4     fansi_1.0.4           
# [43] viridisLite_0.4.1      crayon_1.5.2           withr_2.5.0           
# [46] bitops_1.0-7           MASS_7.3-58.1          grid_4.2.2            
# [49] gtable_0.3.3           lifecycle_1.0.3        DBI_1.1.3             
# [52] magrittr_2.0.3         scales_1.2.1           cli_3.6.0             
# [55] cachem_1.0.7           XVector_0.38.0         farver_2.1.1          
# [58] viridis_0.6.2          doParallel_1.0.17      generics_0.1.3        
# [61] vctrs_0.6.0            org.Hs.eg.db_3.16.0    iterators_1.0.14      
# [64] tools_4.2.2            bit64_4.0.5            Biobase_2.58.0        
# [67] glue_1.6.2             tweenr_2.0.2           purrr_1.0.1           
# [70] ggraph_2.1.0           parallel_4.2.2         fastmap_1.1.1         
# [73] AnnotationDbi_1.60.2   colorspace_2.1-0       tidygraph_1.2.3       
# [76] memoise_2.0.1          knitr_1.42
egeulgen commented 1 year ago

The issue lies where we are checking the (appropriate) column (either 2nd or 3rd) to make sure that all values are numeric. <dbl> is still numberic but the way we have to select the column returns another tbl object when the original table is a tibble. Because we do not want to include another dependency (i.e. tibble), we'd like to support only data.frame inputs a this time. As you point out, as.data.frame() resolves this and many other downstream issues that may arise.