aertslab / arboreto

A scalable python-based framework for gene regulatory network inference using tree-based ensemble regressors.
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

grnboost2 fails on large datasets #41

Open dmalzl opened 1 month ago

dmalzl commented 1 month ago

Hi,

I am currently trying to use grnboost2 to infer GRNs from a dataset of around 120k cells but I am unable to get it to run due to some hard limits imposed by dependencies of the dask distributed package (see here). In brief, dask has a hard limit on the size of the dataset (data chunk) it can serialise, which is 4GB. Anything above this will result in a the following error:

distributed.protocol.core - CRITICAL - Failed to Serialize
ValueError: bytes object is too large

To circumvent this, the developers suggest to move data generation into a separate task to make the workers generate their own data locally. So a workaround would be to be able to provide paths for the data files and move the read to the worker to only have to serialise a couple of strings instead of the whole dataset.

I know this may be a bit more to think about especially to figure out the best strategy to do this (e.g. generate a system that makes data chunks ahead of time writes them to files and then lets the workers read the data back in or something) but maybe worthwhile to support larger datasets.