dviraran / SingleR

SingleR: Single-cell RNA-seq cell types Recognition (legacy version)
GNU General Public License v3.0
271 stars 98 forks source link

Parallelize SingleR.ScoreData with "doFuture" #14

Closed shenyang1981 closed 5 years ago

shenyang1981 commented 5 years ago

Hello, Thank you for developing SingleR. I like it very much. I am wondering if it is possible to parallelize one of time consuming step of scoring data and allow users to pass "step" option to SingleR.ScoreData through SingleR function. Here is my solution, which can be aligned to the framework used in Seurat V3 as well. Please feel free to let me know your comments. Thanks a lot.

dviraran commented 5 years ago

Hi,

Thanks! great to hear that it helps others with their research.

I'm a bit confused here. The time consuming step in my experience is usually not the ScoreData function, which is pretty simple. The step parameter is only for allowing to run samples with tens of thousands of cells without consuming too much memory. The real time consuming step is the fine-tuning step, and it is already parallelized. I'm sure there is a lot more to do there to make it more effective.

Did you see improvement in run time with this fix? if so, I'll be happy to merge it.

Best, Dvir

shenyang1981 commented 5 years ago

Hi Dvir,

Thank you for replying.

As you pointed out the most time consuming step is the fine-tuning step. It could take several days for analyzing a large dataset (~100000 cells). Nevertheless, in my pipeline, as the purpose of using SingleR is for annotating major cell type automatically, the run time can been largely reduced by setting fine tune thres to 0.01 without losing accuracy.

Then the step of calculating spearman cor became a limited step. Therefore, I decide to parallelize it. The reason for using "doFuture" rather than pbmcapply, are 1) I had the problem in which some of sub-processes were lost while the server was fully occupied; 2) Seurat V3 uses future for parallelization, both packages can be easily aligned in the analysis.

I have uploaded some results of benchmark for i) thres 0.05, ii) thres 0.01 without parallelization and iii) thres 0.01 with parallelization (4 cores). The time spent on analyzing 8k cells from PBMC dataset for different scenarios are 443s, 122s and 86s respectively.

The script and results are available here (https://github.com/shenyang1981/SingleR/tree/master/test)

Please have a look and kindly let me know your comments/feedbacks. Thanks.

Bests, Yang

dviraran commented 5 years ago

Awesome. I'll incorporate it.

shenyang1981 commented 5 years ago

Hi Dviraran,

I am very glad that you accepted my request. I noticed one potential tissue for large dataset. Because pbmcapply also depends on "future", it would overwrite the global setting of 'future.globals.maxSize' with default value of "max.vector.size" if users don't provide "max.vector.size" themselves. For large dataset, I usually set "future.globals.maxSize" to larger value (cos we have a big node here ^_^), in case my setting is not overwritten by pbmcapply, I have to set both variables at the beginning. Here are the codes that I put to my script:

options('future.globals.maxSize' = 10001024^3) options('max.vector.size' = 10001024^3)

Maybe you could mention this in the README. Please let me know if you need more information.

Best Regards, Yang

On Mon, Jan 14, 2019 at 9:27 PM dviraran notifications@github.com wrote:

Merged #14 https://github.com/dviraran/SingleR/pull/14 into master.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dviraran/SingleR/pull/14#event-2072852782, or mute the thread https://github.com/notifications/unsubscribe-auth/ATrCS49HdJ_TQwPdNXZawlzL91LpIFUiks5vDOhBgaJpZM4Z9Pvs .