brianhie / scanorama

Panoramic stitching of single cell data
http://scanorama.csail.mit.edu
MIT License
265 stars 49 forks source link

Running Scanorama in parallel with Reticulate #146

Open weshorton opened 1 year ago

weshorton commented 1 year ago

Hello,

I'm having trouble getting scanorama to run in a reasonable time on my computer using the R interface. I realize this is a little outside the scope of scanorama itself, but wanted to see if anyone else has had this issue and was able to figure it out. I followed the instructions from this issue/discussion thread and have the following:

library(reticulate)
use_python("path/to/python")
scanorama <- import("scanorama")
assaylist <- readRDS("./assaylist.rds")
genelist <- readRDS("./genelist.rds")

integrated.corrected.data <- scanorama$correct(assaylist, genelist, return_dense = T, return_dimred = T)

This takes forever. (I'm basing this off the 15 minute estimate here). I killed the process about 45 minutes in and this was the output so far:

Found 22490 genes among all datasets
[[0.         0.70495716 0.09534884]
 [0.         0.         0.68222932]
 [0.         0.         0.        ]]
Processing datasets (0, 1)

I'm trying to integrate 3 scRNAseq datasets which 16,340 cells, 25,981 cells, and 68,433 cells :

> lapply(assaylist, dim)
[[1]]
[1] 16340 32285

[[2]]
[1] 25981 22490

[[3]]
[1] 68433 32285

I tried to follow the instructions to check that numpy is using multiple cores (found here), but my python installation does not seem to correspond with those instructions. I can't find the dist-packages to check that I'm linked to OpenBLAS.

which python
/home/lab/miniconda3/bin/python
python --version
Python 3.7.10

I made the test.py script and ran it and it does appear that 100% of my CPU is being used, which would indicate that I'm not using the parallelization functionality. I also checked my CPU usage when running the scanorama$correct call and saw the same thing.

I tried using the future package with plan("multisession") (found here) and still only have one process running.

I don't see how I can use foreach or one of the other parallel R calls because there's nothing to apply over - just the one call to scanorama$correct

Thanks!