aertslab / arboreto

A scalable python-based framework for gene regulatory network inference using tree-based ensemble regressors.
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

Ability to run Arboreto using a multiprocessing pool in place of Dask #21

Closed cflerin closed 4 years ago

cflerin commented 4 years ago

The ability to run Arboreto across multiple nodes in Dask is extremely powerful, but the implementation has caused lots of issues for me (and others, it seems). In a lot of cases, I had massive issues with the Dask client -- it would sometimes seem to go on computing for days, or just quit halfway through a run with a cryptic error.

In practice, I have only ever used a single node to run GRNBoost2, and it's still quite fast, even for 10s to 100s of thousands of cells. Therefore, I thought this multiprocessing implementation might be useful. I've been using it extensively, and it's quite reliable. In many cases, the compute time is actually slightly shorter when using multiprocessing (perhaps due to some Dask overhead?).

Summary of changes:

As a check, the multiprocessing implementation produces the same results as when using Dask, using a fixed seed:

# test data:
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/allTFs_hg38.txt
wget https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/expr_mat.loom

pip install --force-reinstall git+https://github.com/cflerin/arboreto@multiprocessing
# python:
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
import loompy as lp
import pandas as pd

lf = lp.connect('expr_mat.loom', mode='r', validate=False )
ex_matrix = pd.DataFrame( lf[:,:], index=lf.ra.Gene, columns=lf.ca.CellID ).T
lf.close()

tf_names = load_tf_names('allTFs_hg38.txt')

# Dask test:
network = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    seed=777,
                    verbose=True)

>>> network.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191

# multiprocessing test:
networkMP = grnboost2(expression_data=ex_matrix,
                    tf_names=tf_names,
                    client_or_address='multiprocessing',
                    multiprocessing_workers=7,
                    seed=777,
                    verbose=True)

>>> networkMP.head()
        TF  target  importance
27   RPS4X   RPL30   57.537719
665   SPI1    CSTA   55.521603
27   RPS4X  EEF1A1   54.931686
27   RPS4X   RPS14   53.646867
692  RPL35    RPL3   52.932191
cflerin commented 4 years ago

After some additional testing, it looks like this implementation is not ideal in terms of memory usage with larger matrices (it causes the expression matrix to be copied to each new process instead of using shared memory with the parent process). A better implementation can be found at https://github.com/aertslab/pySCENIC/pull/140, and replaces this with a stand-alone script which imports the relevant Arboreto/pySCENIC functions. The test results are the same as described above.