cole-trapnell-lab / cicero

MIT License
11 stars 9 forks source link

compare_connections() crashes R session #19

Closed cgoneill closed 1 year ago

cgoneill commented 1 year ago

Hello, and thank you for you work on this excellent software package. I've been trying to use compare_connections() to compare two Cicero connection datasets of 23192732 pairs each, but each time I try, my memory usage maxes out after a few minutes and crashes my session. At steady state, my session uses about 9.92 GB of memory, but even when I have 256 GB allocated on an HPC cluster, my memory usage gradually increases until my R session crashes. Here's my session info:

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/local/intel/compilers_and_libraries_2020.2.254/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.1.1               openxlsx_4.2.5.1            rtracklayer_1.58.0          magrittr_2.0.3              cicero_1.3.8                Gviz_1.42.0                
 [7] monocle3_1.3.1              SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0 GenomicRanges_1.50.1        GenomeInfoDb_1.34.3         IRanges_2.32.0             
[13] S4Vectors_0.36.0            MatrixGenerics_1.10.0       matrixStats_0.63.0          Biobase_2.58.0              BiocGenerics_0.44.0         patchwork_1.1.2            
[19] ggplot2_3.4.0               SeuratWrappers_0.3.1        SeuratObject_4.1.3          Seurat_4.2.1                Signac_1.8.0               

loaded via a namespace (and not attached):
  [1] utf8_1.2.2               spatstat.explore_3.0-5   reticulate_1.26          R.utils_2.12.1           tidyselect_1.2.0         lme4_1.1-31              RSQLite_2.2.18          
  [8] AnnotationDbi_1.60.0     htmlwidgets_1.5.4        BiocParallel_1.32.1      Rtsne_0.16               munsell_0.5.0            codetools_0.2-18         ica_1.0-3               
 [15] interp_1.1-3             future_1.29.0            miniUI_0.1.1.1           withr_2.5.0              spatstat.random_3.0-1    colorspace_2.0-3         progressr_0.11.0        
 [22] filelock_1.0.2           knitr_1.41               rstudioapi_0.14          ROCR_1.0-11              tensor_1.5               listenv_0.8.0            GenomeInfoDbData_1.2.9  
 [29] polyclip_1.10-4          bit64_4.0.5              parallelly_1.32.1        vctrs_0.5.1              generics_0.1.3           xfun_0.34                biovizBase_1.46.0       
 [36] BiocFileCache_2.6.0      R6_2.5.1                 rsvd_1.0.5               VGAM_1.1-7               AnnotationFilter_1.22.0  bitops_1.0-7             spatstat.utils_3.0-1    
 [43] cachem_1.0.6             DelayedArray_0.24.0      assertthat_0.2.1         promises_1.2.0.1         BiocIO_1.8.0             scales_1.2.1             nnet_7.3-18             
 [50] gtable_0.3.1             globals_0.16.1           goftest_1.2-3            ensembldb_2.22.0         rlang_1.0.6              RcppRoll_0.3.0           splines_4.2.2           
 [57] lazyeval_0.2.2           dichromat_2.0-0.1        checkmate_2.1.0          spatstat.geom_3.0-3      BiocManager_1.30.19      yaml_2.3.6               reshape2_1.4.4          
 [64] abind_1.4-5              GenomicFeatures_1.50.2   backports_1.4.1          httpuv_1.6.6             Hmisc_4.7-2              tools_4.2.2              ellipsis_0.3.2          
 [71] RColorBrewer_1.1-3       ggridges_0.5.4           Rcpp_1.0.9               plyr_1.8.7               base64enc_0.1-3          progress_1.2.2           zlibbioc_1.44.0         
 [78] purrr_0.3.5              RCurl_1.98-1.9           prettyunits_1.1.1        rpart_4.1.19             deldir_1.0-6             pbapply_1.5-0            zoo_1.8-11              
 [85] ggrepel_0.9.2            cluster_2.1.4            data.table_1.14.4        scattermore_0.8          lmtest_0.9-40            RANN_2.6.1               ProtGenerics_1.30.0     
 [92] fitdistrplus_1.1-8       hms_1.1.2                mime_0.12                xtable_1.8-4             XML_3.99-0.12            jpeg_0.1-9               gridExtra_2.3           
 [99] compiler_4.2.2           biomaRt_2.54.0           tibble_3.1.8             KernSmooth_2.23-20       crayon_1.5.2             minqa_1.2.5              R.oo_1.25.0             
[106] htmltools_0.5.3          later_1.3.0              Formula_1.2-4            tidyr_1.2.1              DBI_1.1.3                dbplyr_2.2.1             rappdirs_0.3.3          
[113] MASS_7.3-58.1            boot_1.3-28              Matrix_1.5-3             cli_3.4.1                R.methodsS3_1.8.2        parallel_4.2.2           igraph_1.3.5            
[120] pkgconfig_2.0.3          GenomicAlignments_1.34.0 foreign_0.8-83           sp_1.5-1                 terra_1.6-17             plotly_4.10.1            spatstat.sparse_3.0-0   
[127] xml2_1.3.3               XVector_0.38.0           VariantAnnotation_1.44.0 stringr_1.4.1            digest_0.6.30            sctransform_0.3.5        RcppAnnoy_0.0.20        
[134] spatstat.data_3.0-0      Biostrings_2.66.0        leiden_0.4.3             fastmatch_1.1-3          htmlTable_2.4.1          uwot_0.1.14              curl_4.3.3              
[141] restfulr_0.0.15          shiny_1.7.3              Rsamtools_2.14.0         rjson_0.2.21             nloptr_2.0.3             lifecycle_1.0.3          nlme_3.1-160            
[148] jsonlite_1.8.3           viridisLite_0.4.1        BSgenome_1.66.1          fansi_1.0.3              pillar_1.8.1             lattice_0.20-45          KEGGREST_1.38.0         
[155] fastmap_1.1.0            httr_1.4.4               survival_3.4-0           glue_1.6.2               remotes_2.4.2            zip_2.2.2                png_0.1-7               
[162] bit_4.0.4                stringi_1.7.8            blob_1.2.3               latticeExtra_0.6-30      memoise_2.0.1            dplyr_1.0.10             irlba_2.3.5.1           
[169] future.apply_1.10.0     

I'm assuming compare_connections() isn't necessarily meant to compare two Cicero datasets (the vignette example really only mentions comparing a Cicero dataset to non-Cicero datasets), and if that's correct, is there a way to do so?

hpliner commented 1 year ago

Hi, to confirm that this is an issue with size, can you try running compare_connections with just a small subset of one of the datasets?

cgoneill commented 1 year ago

I ran code as follows:

> ko.conns.chr3 <- ko.conns[grepl("^chr3-", ko.conns$Peak1) & grepl("^chr3-", ko.conns$Peak2), ] # subsets the dataset from 23192372 connections to 1225206, all on chromosome 3
>
> head(compare_connections(wt.conns, ko.conns.chr3))
[1] FALSE FALSE FALSE FALSE FALSE FALSE

I was also able to run compare_connections() on subsets of both tables using only connections on chromosome 3, both of which had 1225206 connections.

hpliner commented 1 year ago

Hi, sounds like this is an issue with the size. I think for the moment your best bet would be to just run by chromosome. Since cicero does not generate connections between chromosomes you should be able to just run on each separately and concatenate the results. I'll keep this issue open and see if I can write a fix next time I have some engineering time, but that may be awhile. Happy new year!

cgoneill commented 1 year ago

Thank you! Happy new year!