Closed cflerin closed 4 years ago
After some additional testing, it looks like this implementation is not ideal in terms of memory usage with larger matrices (it causes the expression matrix to be copied to each new process instead of using shared memory with the parent process). A better implementation can be found at https://github.com/aertslab/pySCENIC/pull/140, and replaces this with a stand-alone script which imports the relevant Arboreto/pySCENIC functions. The test results are the same as described above.
The ability to run Arboreto across multiple nodes in Dask is extremely powerful, but the implementation has caused lots of issues for me (and others, it seems). In a lot of cases, I had massive issues with the Dask client -- it would sometimes seem to go on computing for days, or just quit halfway through a run with a cryptic error.
In practice, I have only ever used a single node to run GRNBoost2, and it's still quite fast, even for 10s to 100s of thousands of cells. Therefore, I thought this multiprocessing implementation might be useful. I've been using it extensively, and it's quite reliable. In many cases, the compute time is actually slightly shorter when using multiprocessing (perhaps due to some Dask overhead?).
Summary of changes:
client_or_address
parameter to'multiprocessing'
in either of thegrnboost2
orgenie3
functions will run these algorithms using a multiprocessing pool. The number of workers is specified with themultiprocessing_workers
parameter.run_arboreto_mp
to do the work of setting up a multiprocessing pool and calculate links for each target gene separatelyas_matrix
withto_numpy
(minor fix)As a check, the multiprocessing implementation produces the same results as when using Dask, using a fixed seed: