MarioniLab / miloR

R package implementation of Milo for testing for differential abundance in KNN graphs
https://bioconductor.org/packages/release/bioc/html/miloR.html
GNU General Public License v3.0
343 stars 22 forks source link

buildFromAdjacency huge RAM usage #351

Closed pedriniedoardo closed 1 week ago

pedriniedoardo commented 1 week ago

Describe the bug I experience a huge amount of RAM usage when trying to add a custom graph object to the milo object using the recommended miloR::graph(milo) <- miloR::graph(buildFromAdjacency(sobj@graphs$RNA_snn, k=10))

Minimum code example Minimum example to reproduce the case

# ref <- readRDS(file = "../../data/harmonyHO.rds")
# graph_obj <- ref@graphs$RNA_snn
# saveRDS(graph_obj,"../../out/graph_obj.rds")

graph_obj <- readRDS("../../out/object/graph_obj.rds")
graph_obj.size <- object.size(graph_obj)
print(graph_obj.size,units = "Mb")
> 124.9 Mb

# build the adiacency matrix
test <- buildFromAdjacency(graph_obj, k=10,is.binary = F)
test.size <- object.size(test)
print(test.size,units = "Mb")
> 67.1 Mb

# miloR::graph(milo3) <- miloR::graph(test)

image

Session info Output of sessionInfo()

R version 4.4.1 (2024-06-14)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Rome
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] Seurat_5.1.0       miloR_2.2.0        edgeR_4.4.0        limma_3.62.1       SeuratObject_5.0.2 sp_2.1-4          

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.4.1               later_1.3.2                 tibble_3.2.1               
  [5] polyclip_1.10-7             fastDummies_1.7.4           lifecycle_1.0.4             globals_0.16.3             
  [9] lattice_0.22-6              MASS_7.3-60.2               magrittr_2.0.3              plotly_4.10.4              
 [13] httpuv_1.6.15               sctransform_0.4.1           spam_2.11-0                 spatstat.sparse_3.1-0      
 [17] reticulate_1.39.0           cowplot_1.1.3               pbapply_1.7-2               RColorBrewer_1.1-3         
 [21] abind_1.4-8                 zlibbioc_1.52.0             Rtsne_0.17                  GenomicRanges_1.58.0       
 [25] purrr_1.0.2                 ggraph_2.2.1                BiocGenerics_0.52.0         pracma_2.4.4               
 [29] tweenr_2.0.3                GenomeInfoDbData_1.2.13     IRanges_2.40.0              S4Vectors_0.44.0           
 [33] ggrepel_0.9.6               irlba_2.3.5.1               listenv_0.9.1               spatstat.utils_3.1-1       
 [37] goftest_1.2-3               RSpectra_0.16-2             spatstat.random_3.3-2       fitdistrplus_1.2-1         
 [41] parallelly_1.38.0           leiden_0.4.3.1              codetools_0.2-20            DelayedArray_0.32.0        
 [45] ggforce_0.4.2               tidyselect_1.2.1            UCSC.utils_1.2.0            farver_2.1.2               
 [49] ScaledMatrix_1.14.0         viridis_0.6.5               matrixStats_1.4.1           stats4_4.4.1               
 [53] spatstat.explore_3.3-3      jsonlite_1.8.9              BiocNeighbors_2.0.0         tidygraph_1.3.1            
 [57] progressr_0.15.0            ggridges_0.5.6              survival_3.6-4              tools_4.4.1                
 [61] ica_1.0-3                   Rcpp_1.0.13-1               glue_1.8.0                  gridExtra_2.3              
 [65] SparseArray_1.6.0           MatrixGenerics_1.18.0       GenomeInfoDb_1.42.0         dplyr_1.1.4                
 [69] withr_3.0.2                 numDeriv_2016.8-1.1         BiocManager_1.30.25         fastmap_1.2.0              
 [73] fansi_1.0.6                 digest_0.6.37               rsvd_1.0.5                  R6_2.5.1                   
 [77] mime_0.12                   colorspace_2.1-1            scattermore_1.2             gtools_3.9.5               
 [81] tensor_1.5                  spatstat.data_3.1-2         utf8_1.2.4                  tidyr_1.3.1                
 [85] generics_0.1.3              renv_1.0.10                 data.table_1.16.2           graphlayouts_1.2.0         
 [89] httr_1.4.7                  htmlwidgets_1.6.4           S4Arrays_1.6.0              uwot_0.2.2                 
 [93] pkgconfig_2.0.3             gtable_0.3.6                lmtest_0.9-40               SingleCellExperiment_1.28.0
 [97] XVector_0.46.0              htmltools_0.5.8.1           dotCall64_1.2               scales_1.3.0               
[101] Biobase_2.66.0              png_0.1-8                   spatstat.univar_3.1-1       rstudioapi_0.17.1          
[105] reshape2_1.4.4              nlme_3.1-164                cachem_1.1.0                zoo_1.8-12                 
[109] stringr_1.5.1               KernSmooth_2.23-24          parallel_4.4.1              miniUI_0.1.1.1             
[113] vipor_0.4.7                 pillar_1.9.0                grid_4.4.1                  vctrs_0.6.5                
[117] RANN_2.6.2                  promises_1.3.0              BiocSingular_1.22.0         beachmat_2.22.0            
[121] xtable_1.8-4                cluster_2.1.6               beeswarm_0.4.0              cli_3.6.3                  
[125] locfit_1.5-9.10             compiler_4.4.1              rlang_1.1.4                 crayon_1.5.3               
[129] future.apply_1.11.3         plyr_1.8.9                  ggbeeswarm_0.7.2            stringi_1.8.4              
[133] viridisLite_0.4.2           deldir_2.0-4                BiocParallel_1.40.0         munsell_0.5.1              
[137] lazyeval_0.2.2              spatstat.geom_3.3-3         Matrix_1.7-0                RcppHNSW_0.6.0             
[141] patchwork_1.3.0             future_1.34.0               ggplot2_3.5.1               statmod_1.5.0              
[145] shiny_1.9.1                 SummarizedExperiment_1.36.0 ROCR_1.0-11                 igraph_2.1.1               
[149] memoise_2.0.1              

I fear that if I use a bigger dataset (this was a ~130k cells dataset), the processing might fail. Is there a way to avoid this huge (intermediate, meaning the input is 100Mb and the output is 60Mb, but during the processing, it spikes to 256GB) RAM usage?

I have tried to run the buildFromAdjacency using is.binary = T. The RAM usage is very moderate, but the problem is downstream. I cannot get good Neighbour communities during the makeNhoods run.

milo3 <- makeNhoods(milo3, prop = 0.1, k = 10, d=30, refined = TRUE)
plotNhoodSizeHist(milo3)

image

This is the comparison of the Neighbour communities size using the is.binary = T parameter.

milo4 <- makeNhoods(milo4, prop = 0.1, k = 10, d=30, refined = TRUE)
plotNhoodSizeHist(milo4)

image

This is much better. Alternatively, do you have any suggestions for building communities of better size using the is.binary = T parameter?

here the sample graph_obj I have used. Thank you very much for maintaining the tool!

MikeDMorgan commented 1 week ago

Hi @pedriniedoardo can you double-check what the class of ref@graphs$RNA_snn is?

pedriniedoardo commented 1 week ago

hello @MikeDMorgan

> class(ref@graphs$RNA_snn)
[1] "Graph"
attr(,"package")
[1] "SeuratObject"
MikeDMorgan commented 1 week ago

buildFromAdjacency expects a matrix as input, i.e. an adjacency matrix. Without delving into Seurat, it's not clear what this SeuratObject/Graph object representation is. Try casting to a dense matrix format to see if you get the same memory spike, then compare to casting to a sparse matrix format.

pedriniedoardo commented 1 week ago

Thank you for the suggestion @MikeDMorgan !

When I look at the structure of the graph_obj extracted from seurat ref@graphs$RNA_snn it definitely looks like a sparse matrix to me...

> str(graph_obj)
Formal class 'Graph' [package "SeuratObject"] with 7 slots
  ..@ assay.used: chr "RNA"
  ..@ i         : int [1:8925812] 0 731 3101 3656 5920 7085 11523 12065 12548 16391 ...
  ..@ p         : int [1:132737] 0 35 62 157 202 276 362 433 488 560 ...
  ..@ Dim       : int [1:2] 132736 132736
  ..@ Dimnames  :List of 2
  .. ..$ : chr [1:132736] "108_AAACCCAAGATACCAA-1" "108_AAACCCAAGGCCCGTT-1" "108_AAACCCACACAGTACT-1" "108_AAACCCACAGGGACTA-1" ...
  .. ..$ : chr [1:132736] "108_AAACCCAAGATACCAA-1" "108_AAACCCAAGGCCCGTT-1" "108_AAACCCACACAGTACT-1" "108_AAACCCACAGGGACTA-1" ...
  ..@ x         : num [1:8925812] 1 0.0811 0.1111 0.0811 0.1429 ...
  ..@ factors   : list()

I therefore changed the class of the object using as.

> test_sparse <- as(graph_obj, "TsparseMatrix")
> str(test_sparse)
Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:8925812] 0 731 3101 3656 5920 7085 11523 12065 12548 16391 ...
  ..@ j       : int [1:8925812] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ Dim     : int [1:2] 132736 132736
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:132736] "108_AAACCCAAGATACCAA-1" "108_AAACCCAAGGCCCGTT-1" "108_AAACCCACACAGTACT-1" "108_AAACCCACAGGGACTA-1" ...
  .. ..$ : chr [1:132736] "108_AAACCCAAGATACCAA-1" "108_AAACCCAAGGCCCGTT-1" "108_AAACCCACACAGTACT-1" "108_AAACCCACAGGGACTA-1" ...
  ..@ x       : num [1:8925812] 1 0.0811 0.1111 0.0811 0.1429 ...
  ..@ factors : list()

Now, if I print the object, it definitely is a sparse matrix non binary matrix with the expected dimensions.

> test_sparse
132736 x 132736 sparse Matrix of class "dgTMatrix"
  [[ suppressing 59 column names ‘108_AAACCCAAGATACCAA-1’, ‘108_AAACCCAAGGCCCGTT-1’, ‘108_AAACCCACACAGTACT-1’ ... ]]
  [[ suppressing 59 column names ‘108_AAACCCAAGATACCAA-1’, ‘108_AAACCCAAGGCCCGTT-1’, ‘108_AAACCCACACAGTACT-1’ ... ]]

108_AAACCCAAGATACCAA-1 1 . . . . . . . . . . .          . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCAAGGCCCGTT-1 . 1 . . . . . . . . . .          . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCACACAGTACT-1 . . 1 . . . . . . . . 0.08108108 . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCACAGGGACTA-1 . . . 1 . . . . . . . .          . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCACATAATCGC-1 . . . . 1 . . . . . . .          . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCACATGTTACG-1 . . . . . 1 . . . . . .          . . . . . . . . . . . . . . . . . . 0.3333333 . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCAGTAACTAAG-1 . . . . . . 1 . . . . .          . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .
108_AAACCCAGTCCCTGTT-1 . . . . . . . 1 . . . .          . . . . . . . . . . . . . . . . . . .         . . . . . . . . . . . . . . . . . . . . . . . .

108_AAACCCAAGATACCAA-1 . . . . ......
108_AAACCCAAGGCCCGTT-1 . . . . ......
108_AAACCCACACAGTACT-1 . . . . ......
108_AAACCCACAGGGACTA-1 . . . . ......
108_AAACCCACATAATCGC-1 . . . . ......
108_AAACCCACATGTTACG-1 . . . . ......
108_AAACCCAGTAACTAAG-1 . . . . ......
108_AAACCCAGTCCCTGTT-1 . . . . ......

 ..............................
 ........suppressing 132677 columns and 132720 rows in show(); maybe adjust options(max.print=, width=)
 ..............................
  [[ suppressing 59 column names ‘108_AAACCCAAGATACCAA-1’, ‘108_AAACCCAAGGCCCGTT-1’, ‘108_AAACCCACACAGTACT-1’ ... ]]

95_TTTGTTGCATCTCATT-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGCATGAGTAA-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGGTCCCTCAT-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGGTGGATACG-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGGTTGTAAAG-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGTCACCATAG-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGTCAGGAACG-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......
95_TTTGTTGTCTACCAGA-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......

The problem is that if I now use this matrix instead of the graph object, inside the buildFromAdjacency, I still observe the spike in RAM usage.

I therefore looked inside the implementation of buildFromAdjacency, and I was trying to generate the milo-compatible graph from scratch. I think the problem is at line 72 of the function implementation.

 # use igraph if it's square
    if(is.square){
        if(!is.binary){
            bin.x <- as(matrix(as.numeric(x > 0), nrow=nrow(x)), "dgCMatrix")
            nn.graph <- graph_from_adjacency_matrix(bin.x, mode="max",
                                                    weighted=NULL,
                                                    diag=FALSE)
        } 

The matrix is squared and not binary. here is the sample sparse matrix. If I run, bin.x <- as(matrix(as.numeric(x > 0), nrow=nrow(x)), "dgCMatrix") I get the RAM spike. Do you have any suggestion ?

MikeDMorgan commented 1 week ago

Yes, the issue is the coercion goes through a dense matrix because the object from Seurat can't be directly cast into a sparse matrix format. This is an issue with using weighted graphs - Milo assumes an unweighted graph, as you can see here because weighted=NULL.

Because Milo never uses the weights, I would input a sparse binary matrix yourself to save the memory spike, and pass that directly into buildFromAdjacency.

pedriniedoardo commented 1 week ago

Thank you very much @MikeDMorgan your suggestion has solved everything! here my solution:

# load the graph object extracted from seurat ref@graphs$RNA_snn
graph_obj <- readRDS("../../sharespace/graph_obj.rds")
str(graph_obj)

# make it a sparse matrix
graph_obj_fix <- as(graph_obj, "dgCMatrix")

# make it a binary sparse matrix
graph_obj_fix2 <- graph_obj_fix
graph_obj_fix2@x <- rep(x = 1,length(graph_obj_fix2@x))

# build the adiacency matrix using binary as T
test <- buildFromAdjacency(graph_obj_fix2, k=10,is.binary = T)

# load the original reference dataset
ref <- readRDS(file = "/beegfs/scratch/ric.cosr/ric.cosr/ric.brunelli/BrunelliS_1810_scRNA_wt_mut_mice/data/R_out/RDS/harmonyHO.rds")

# build the milo object
sce3 <- as.SingleCellExperiment(ref)
milo3 <- Milo(sce3)

# add the graph to the object
miloR::graph(milo3) <- miloR::graph(test)

# build communities
milo3 <- makeNhoods(milo3, prop = 0.1, k = 10, d=30, refined = TRUE)
plotNhoodSizeHist(milo3)

image

It is all working now! Thank you very much!