Is scaleDims scaling along the wrong axis?

dannyconrad commented 9 months ago

I read the issue threads #323 & #447 because I've been working a lot with the LSI components of my datasets lately. I think I found the source of the confusion that lead to those posts.

When scaleDims is set to T (the default), the rowZscores function is invoked which scales the LSI component values of each individual cell by row, i.e. using the mean and SD of each individual cell's N components. Obviously based on the name of the function this seems by design.

However I also saw in the documentation of getReducedDims() that the idea to do so was based on the scaling Tim Stuart introduced in Signac::RunSVD(). I dug into the code of RunSVD() and there the scaling is not done by row, but rather by column, i.e. scaling by the mean and SD of each component instead.

The relevant code within RunSVD():

if (scale.embeddings) {
    if (verbose) {
      message("Scaling cell embeddings")
    }
    embed.mean <- apply(X = cell.embeddings, MARGIN = 2, 
      FUN = mean)
    embed.sd <- apply(X = cell.embeddings, MARGIN = 2, FUN = sd)
    norm.embeddings <- t((t(cell.embeddings) - embed.mean)/embed.sd)
    if (!is.null(x = scale.max)) {
      norm.embeddings[norm.embeddings > scale.max] <- scale.max
      norm.embeddings[norm.embeddings < -scale.max] <- -scale.max
    }
  }

As far as I can tell, in both cases the input matrix has components as columns and cells as rows.

This discrepancy is why the corToDepth vector of the scaled embeddings is so different and the scaled dimensions no longer really correlate with nFrags. Since the scaled values are used by default, LSI_1 is almost never filtered out by corCutOff even when it should be.

To reproduce/verify this, you can just check the ranked order of the values along each axis before and after the scaling is performed:

## RunSVD()
SeuratObject <- RunTFIDF(SeuratObject, assay = "peaks")
SeuratObject <- FindTopFeatures(SeuratObject, assay = "peaks", min.cutoff = 'q0')

svd_scale <- RunSVD(SeuratObject, assay = "peaks", scale.embeddings = TRUE)@reductions$svd@cell.embeddings
svd_raw <- RunSVD(SeuratObject, assay = "peaks", scale.embeddings = FALSE)@reductions$svd@cell.embeddings

# this is TRUE, i.e. cells are ranked the same within each component due to scaling by column
identical(svd_scale[,1] %>% order(),
          svd_raw[,1] %>% order()) 
# this is FALSE, i.e. the rank of each component per cell has changed
identical(svd_scale[1,] %>% order(),
          svd_raw[1,] %>% order())

## getReducedDims()
ArchRProject <- addIterativeLSI(ArchRProject)

# this is FALSE, i.e. the cells are ordered differently within each component
identical(getReducedDims(ArchRProject, scaleDims = T, corCutOff = 1)[,1] %>% order(),
          getReducedDims(ArchRProject, scaleDims = F, corCutOff = 1)[,1] %>% order())
# this is TRUE, i.e. the components of each cell are ranked the same due to scaling by row
identical(getReducedDims(ArchRProject, scaleDims = T, corCutOff = 1)[1,] %>% order(),
          getReducedDims(ArchRProject, scaleDims = F, corCutOff = 1)[1,] %>% order())

Because of the way it messes up the depth correlation and artificially rearranges the cells relative to one another in the lower dimensional space, I'm guessing this is not the correct way to scale the LSI dimensions, but maybe I'm wrong and this was done orthogonally to Signac on purpose? Or have I missed some key detail here?

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Signac_0.2.5                          BSgenome.Hsapiens.UCSC.hg19_1.4.3     BSgenome.Hsapiens.UCSC.hg38_1.4.5    
 [4] Rsamtools_2.14.0                      BSgenome.Mmulatta.UCSC.rheMac10_1.4.2 BSgenome_1.66.2                      
 [7] rtracklayer_1.58.0                    Biostrings_2.66.0                     XVector_0.38.0                       
[10] uwot_0.1.14                           nabor_0.5.0                           SeuratObject_4.1.3                   
[13] Seurat_4.3.0                          cowplot_1.1.1                         RColorBrewer_1.1-3                   
[16] dplyr_1.1.0                           deMULTIplex2_1.0.1                    rhdf5_2.42.0                         
[19] SummarizedExperiment_1.28.0           Biobase_2.58.0                        MatrixGenerics_1.10.0                
[22] Rcpp_1.0.10                           Matrix_1.5-3                          GenomicRanges_1.50.2                 
[25] GenomeInfoDb_1.34.9                   IRanges_2.32.0                        S4Vectors_0.36.1                     
[28] BiocGenerics_0.44.0                   matrixStats_0.63.0                    data.table_1.14.6                    
[31] stringr_1.5.0                         plyr_1.8.8                            magrittr_2.0.3                       
[34] ggplot2_3.4.0                         gtable_0.3.1                          gtools_3.9.4                         
[37] gridExtra_2.3                         ArchR_1.0.2                          

loaded via a namespace (and not attached):
  [1] utf8_1.2.3               spatstat.explore_3.0-6   reticulate_1.28          R.utils_2.12.2           tidyselect_1.2.0        
  [6] htmlwidgets_1.6.1        BiocParallel_1.32.5      Rtsne_0.16               devtools_2.4.5           munsell_0.5.0           
 [11] codetools_0.2-19         ica_1.0-3                interp_1.1-3             future_1.31.0            miniUI_0.1.1.1          
 [16] withr_2.5.0              spatstat.random_3.1-3    colorspace_2.1-0         progressr_0.13.0         rstudioapi_0.14         
 [21] ROCR_1.0-11              tensor_1.5               listenv_0.9.0            labeling_0.4.2           GenomeInfoDbData_1.2.9  
 [26] hwriter_1.3.2.1          polyclip_1.10-4          farver_2.1.1             pheatmap_1.0.12          parallelly_1.34.0       
 [31] vctrs_0.5.2              generics_0.1.3           ggseqlogo_0.2            R6_2.5.1                 cachem_1.0.6            
 [36] bitops_1.0-7             rhdf5filters_1.10.0      spatstat.utils_3.0-1     DelayedArray_0.24.0      promises_1.2.0.1        
 [41] BiocIO_1.8.0             scales_1.2.1             ggExtra_0.10.0           Cairo_1.6-0              globals_0.16.2          
 [46] processx_3.8.0           goftest_1.2-3            rlang_1.0.6              gggenes_0.5.1            RcppRoll_0.3.0          
 [51] splines_4.2.2            lazyeval_0.2.2           spatstat.geom_3.0-6      BiocManager_1.30.19      yaml_2.3.7              
 [56] reshape2_1.4.4           abind_1.4-5              httpuv_1.6.8             usethis_2.1.6            tools_4.2.2             
 [61] ellipsis_0.3.2           sessioninfo_1.2.2        ggridges_0.5.4           zlibbioc_1.44.0          purrr_1.0.1             
 [66] RCurl_1.98-1.10          prettyunits_1.1.1        ps_1.7.2                 deldir_1.0-6             pbapply_1.7-0           
 [71] viridis_0.6.2            urlchecker_1.0.1         zoo_1.8-11               ggrepel_0.9.2            cluster_2.1.4           
 [76] fs_1.6.0                 scattermore_0.8          lmtest_0.9-40            RANN_2.6.1               fitdistrplus_1.1-8      
 [81] pkgload_1.3.2            patchwork_1.1.2          mime_0.12                xtable_1.8-4             XML_3.99-0.13           
 [86] jpeg_0.1-10              readxl_1.4.1             compiler_4.2.2           tibble_3.1.8             KernSmooth_2.23-20      
 [91] crayon_1.5.2             R.oo_1.25.0              htmltools_0.5.4          mgcv_1.8-41              later_1.3.0             
 [96] tidyr_1.3.0              DBI_1.1.3                MASS_7.3-58.2            ShortRead_1.56.1         cli_3.6.2               
[101] R.methodsS3_1.8.2        parallel_4.2.2           igraph_2.0.1.1           pkgconfig_2.0.3          GenomicAlignments_1.34.0
[106] sp_1.6-0                 plotly_4.10.1            spatstat.sparse_3.0-0    stringdist_0.9.10        callr_3.7.3             
[111] digest_0.6.31            sctransform_0.3.5        RcppAnnoy_0.0.20         spatstat.data_3.0-0      cellranger_1.1.0        
[116] leiden_0.4.3             restfulr_0.0.15          shiny_1.7.4              rjson_0.2.21             lifecycle_1.0.3         
[121] nlme_3.1-162             jsonlite_1.8.4           Rhdf5lib_1.20.0          viridisLite_0.4.1        fansi_1.0.4             
[126] pillar_1.8.1             lattice_0.20-45          pkgbuild_1.4.0           fastmap_1.1.0            httr_1.4.4              
[131] survival_3.5-0           remotes_2.4.2            glue_1.6.2               png_0.1-8                profvis_0.3.7           
[136] stringi_1.7.12           ggfittext_0.10.2         memoise_2.0.1            latticeExtra_0.6-30      irlba_2.3.5.1           
[141] future.apply_1.10.0

rcorces commented 9 months ago

Hi @dannyconrad! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
It is worth noting that there are very few actual bugs in ArchR. If you are getting an error, it is probably something specific to your dataset, usage, or computational environment, all of which are extremely challenging to troubleshoot. As such, we require reproducible examples (preferably using the tutorial dataset) from users who want assistance. If you cannot reproduce your error, we will not be able to help. Before going through the work of making a reproducible example, search the previous Issues, Discussions, function definitions, or the ArchR manual and you will likely find the answers you are looking for. If your post does not contain a reproducible example, it is unlikely to receive a response.
In addition to a reproducible example, you must do the following things before we help you, unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Did you post your log file? If not, add it now. 3.__ Remove any screenshots that contain text and instead copy and paste the text using markdown's codeblock syntax (three consecutive backticks). You can do this by editing your original post.

dannyconrad commented 9 months ago

I didn't realize the Discussion section was separate and just noticed that someone brought up a similar concern at #1935, but it hasn't been answered yet. Since I consider this may be a "bug" even though it doesn't throw an error I'll leave this post in this section for now.

dannyconrad commented 9 months ago

For additional context, here's an example of the scaled vs non-scaled corToDepth vectors, which would be the same if the reduced embedding was getting scaled by component instead of by cell:

> lapply(proj_2@reducedDims$IterativeLSI$corToDepth, head)
$scaled
      LSI1       LSI2       LSI3       LSI4       LSI5       LSI6 
0.03715766 0.23018184 0.09827107 0.10180009 0.11730853 0.22932629 

$none
     LSI1      LSI2      LSI3      LSI4      LSI5      LSI6 
0.9068793 0.3038990 0.1302208 0.1026349 0.1169979 0.2494179

dannyconrad commented 8 months ago

To add to this, it seems (in my hands) that the only serious change that scaleDims brings about is whether or not that first dimension is excluded by the corCutOff parameter. I won't pretend to have a grasp of the underlying math but it seems the row-scaling of the LSI matrix does not have much of an impact on the final UMAP embedding. Based on my testing below, my best guess is this is a result of the use of cosine distance metric in running uwot::umap().

To try things out on a dataset I'm working on, I implemented my own column-based scaling and applied it to the non-scaled LSI embedding like so:

colZscores <- function (m = NULL) 
{
  z <- sweep(t(t(m) - colMeans(m)), 2, matrixStats::colSds(m), `/`)
  return(z)
}

proj@reducedDims$LSI_ATAC_ColScale <- proj@reducedDims$LSI_ATAC_Raw
proj@reducedDims$LSI_ATAC_ColScale$matSVD <- colZscores(proj@reducedDims$LSI_ATAC_ColScale$matSVD)

Here are the 2nd and 3rd LSI dimensions plotted raw, with row-scaling, and with column-scaling.

Here are the resulting UMAPs computed using cosine distance (default). I believe this shows that only column-scaling transforms the UMAP embedding in a meaningful way compared to using the raw LSI values.

And finally, just out of curiosity, I computed UMAPs using euclidean distance. Interestingly, the row-scaled LSI UMAP looks very similar to the cosine-computed raw and row-scaled UMAPs. Not really sure why but thought it was interesting.

All UMAPs were generated with dimsToUse=2:30.

Since the goal of the Z-scaling is to reduce bias from dominant early PCs, I think this demonstrates that the current implementation is not actually accomplishing that.

GreenleafLab / ArchR

Is scaleDims scaling along the wrong axis? #2120