Vivianstats / scImpute

Accurate and robust imputation of scRNA-seq data
https://www.nature.com/articles/s41467-018-03405-7
90 stars 34 forks source link

Processing time #19

Closed hr1912 closed 5 years ago

hr1912 commented 5 years ago

Hi Vivian,

I am trying to impute dropouts from a csv of UMI values (around 40000 genes and 6000 cells).

The codes are listed below.

scImpute::scimpute(count_path = "all_umi_raw.csv", infile = "csv", outfile = "csv", out_dir = "test_scimpute", labeled = F, drop_thre = .5, Kcluster = 5, ncores = 10)

It takes over 60 gigabytes of ram and is running slowly. Is that normal? Can I make this faster?

Thanks!

Vivianstats commented 5 years ago

Hello,

Excuse me for the late reply. My closest experience to your case is a UMI dataset with ~20000 genes and 4500 cells, and it finished within 1 hour with 30 cores.

Did you manage to obtain the results, and if yes, how long did it take?

hr1912 commented 5 years ago

Hi Vivian, Thanks for your reply. I have not got results yet cause it is still running 😜. I think it is stuck on calculating distances between cells.

I am pasting the log and r session info below for your information:

1 [1] "reading in raw count matrix ..." 
2 [1] "number of genes in raw count matrix 6104"
3 [1] "number of cells in raw count matrix 40534"
4 [1] "reading finished!"
5 [1] "imputation starts ..."
6 [1] "searching candidate neighbors ... "
7 [1] "inferring cell similarities ..."
8 [1] "dimension reduction ..."
9 [1] "calculating cell distances ..."
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS: /extraspace/hruan/softs/R-3.5.1/lib64/R/lib/libRblas.so
LAPACK: /extraspace/hruan/softs/R-3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] scImpute_0.0.8    doParallel_1.0.11 iterators_1.0.10  foreach_1.4.4    
[5] penalized_0.9-51  survival_2.42-3  

loaded via a namespace (and not attached):
[1] compiler_3.5.1   Matrix_1.2-14    rsvd_0.9         Rcpp_0.12.18    
[5] codetools_0.2-15 splines_3.5.1    grid_3.5.1       kernlab_0.9-26  
[9] lattice_0.20-35 
Vivianstats commented 5 years ago

Given you have over 40,000 cells, it is expected to run for a longer time, but it's surprising that it's still at the stage of calculating cell similarities. I have updated the package to make this step faster, but in your case, can you let me know if you are using an independent server, or using computing nodes from a cluster?

hr1912 commented 5 years ago

Hi Vivian,

We are using an independent server (RHEL 7) with 48 core of CPU and 512 RAM.

Vivianstats commented 5 years ago

Thanks for the information. I would expect it to run faster on your platform, but let me do some experiments on my side to check.