bwlewis / irlba

Fast truncated singular value decompositions
127 stars 17 forks source link

Different results obtained on the same input stored using dense or sparse matrices #61

Open AndiMunteanu opened 2 years ago

AndiMunteanu commented 2 years ago

I've been using the irlba package on the same input stored both as a dense and as a sparse matrix; I noticed that the PCA output is influenced by the type of matrix storage format. Here is an example to illustrate this point. I ran irlba on the pbmc_small dataset (toy dataset, part of the Seurat package)

user.seed = 2016

dense_matrix = as.matrix(t(pbmc_small@assays[["RNA"]]@data))
sparse_matrix = as(dense_matrix, "sparseMatrix")
print(sum(abs(as.matrix(sparse_matrix) - dense_matrix)))

set.seed(user.seed)
pca.results <- irlba::irlba(A = dense_matrix, nv = 50)
dense_embedding = pca.results$u %*% diag(pca.results$d)

set.seed(user.seed)
pca.results <- irlba::irlba(A = sparse_matrix, nv = 50)

sparse_embedding = pca.results$u %*% diag(pca.results$d)

The dense and the sparse objects stem from the same initial matrix. The seed was set to the same value (2016); the other parameters, nv and tol were set to the same values for both instances (50 and 1e-5).

If we subtract the absolute values of dense_embedding and sparse_embedding, we get a maximum value of 1.902228e-08 (the code I've used for this was max(abs(abs(dense_embedding) - abs(sparse_embedding)))). I also plotted the difference distributions between the two embeddings across each Principal Component.

sparse_vs_dense_pbmc

Although I do not consider 2e-08 being a negligible value, working with larger datasets results in even higher differences (the following plot was created on a dataset with 1880 points and 31832 features, where the maximum of the differences between the PCAs was 0.0014). The dotted line indicates the value of the tolerance parameter, set by default to 1e-05.

sparse_vs_dense_cuomo

I an happy to share the code used for generating this plot if required.

My question is: should changing the matrix storage format affect the irlba results in the way seen above? Below is the sessionInfo() output:

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20.1

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=ro_RO.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=ro_RO.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ro_RO.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4.4     irlba_2.3.3        Matrix_1.3-4       ggplot2_3.3.5      Seurat_4.0.4       SeuratObject_4.0.2

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_2.0-2      deldir_1.0-2          ellipsis_0.3.2        ggridges_0.5.3        spatstat.data_2.1-0   farver_2.1.0         
  [8] leiden_0.3.9          listenv_0.8.0         ggrepel_0.9.1         fansi_0.5.0           codetools_0.2-18      splines_4.1.2         knitr_1.36           
 [15] polyclip_1.10-0       jsonlite_1.7.2        ica_1.0-2             cluster_2.1.2         png_0.1-7             uwot_0.1.10           shiny_1.7.1          
 [22] sctransform_0.3.2     spatstat.sparse_2.0-0 readr_2.0.2           compiler_4.1.2        httr_1.4.2            assertthat_0.2.1      fastmap_1.1.0        
 [29] lazyeval_0.2.2        later_1.3.0           htmltools_0.5.2       tools_4.1.2           igraph_1.2.6          gtable_0.3.0          glue_1.4.2           
 [36] RANN_2.6.1            dplyr_1.0.7           Rcpp_1.0.7            scattermore_0.7       vctrs_0.3.8           nlme_3.1-152          lmtest_0.9-38        
 [43] xfun_0.26             stringr_1.4.0         globals_0.14.0        mime_0.12             miniUI_0.1.1.1        lifecycle_1.0.1       goftest_1.2-3        
 [50] future_1.22.1         MASS_7.3-54           zoo_1.8-9             scales_1.1.1          spatstat.core_2.3-0   hms_1.1.1             promises_1.2.0.1     
 [57] spatstat.utils_2.2-0  parallel_4.1.2        RColorBrewer_1.1-2    yaml_2.2.1            reticulate_1.22       pbapply_1.5-0         gridExtra_2.3        
 [64] rpart_4.1-15          stringi_1.7.5         rlang_0.4.11          pkgconfig_2.0.3       matrixStats_0.61.0    evaluate_0.14         lattice_0.20-45      
 [71] ROCR_1.0-11           purrr_0.3.4           tensor_1.5            patchwork_1.1.1       htmlwidgets_1.5.4     cowplot_1.1.1         tidyselect_1.1.1     
 [78] parallelly_1.28.1     RcppAnnoy_0.0.19      plyr_1.8.6            magrittr_2.0.1        R6_2.5.1              generics_0.1.0        DBI_1.1.1            
 [85] pillar_1.6.3          withr_2.4.2           mgcv_1.8-38           fitdistrplus_1.1-6    survival_3.2-13       abind_1.4-5           tibble_3.1.5         
 [92] future.apply_1.8.1    crayon_1.4.1          KernSmooth_2.23-20    utf8_1.2.2            spatstat.geom_2.2-2   plotly_4.9.4.1        tzdb_0.1.2           
 [99] rmarkdown_2.11        ClustAssess_0.2.0     grid_4.1.2            data.table_1.14.2     digest_0.6.28         xtable_1.8-4          tidyr_1.1.4          
[106] httpuv_1.6.3          munsell_0.5.0         viridisLite_0.4.0  
bwlewis commented 2 years ago

Thanks for this excellent issue!

I agree that storage format should not alter the resulst beyond usual floating point limits. I'm investigating this example carefully and trying to get to the bottom of this.

LTLA commented 2 years ago

My 2c: this doesn't seem particularly unusual for an iterative algorithm if there are differences in numerical precision for the sparse matrix multiplication operator (based on CHOLMOD IIRC) and its dense counterpart (LAPACK's dgemm). From experience with other algorithms - namely the C++ code in Rtsne - I've noticed that very minor changes in precision - e.g., flipping the least significant bit in a double-precision value - can happily propagate into very large differences in the final result.