aertslab / arboreto

A scalable python-based framework for gene regulatory network inference using tree-based ensemble regressors.
BSD 3-Clause "New" or "Revised" License
50 stars 24 forks source link

grnboost2 TypeError: Must supply at least one delayed object #42

Open anna4kaa opened 2 weeks ago

anna4kaa commented 2 weeks ago

Hi!

GRNBoost2 produces an error at the very last step. The same happens when I use GENIE3. It seems to be a problem with Dask, however, I could not figure out what is going on.

The code:

import os
import pandas as pd
from distributed import Client, LocalCluster
from arboreto.algo import grnboost2, genie3
from arboreto.utils import load_tf_names

in_file= '/Users/annasve/Desktop/data/transcriptomics/output/PyWGCNA/NBC_00001/log_tpm.csv'
tf_file = '/Users/annasve/Desktop/data/transcriptomics/output/arboreto/output/NBC_00001/tf_list.csv'

ex_matrix = pd.read_csv(in_file, index_col = 0)
tf_names = load_tf_names(tf_file)

network = grnboost2(expression_data=ex_matrix, tf_names=tf_names, verbose = True)

The error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[15], line 9
      6 tf_names = load_tf_names(tf_file)
      8 # Run GRNBoost2 with explicitly provided gene_names and tf_names
----> 9 network = grnboost2(expression_data=ex_matrix, tf_names=tf_names, verbose = True)
     11 network.to_csv(out_file)

File ~/anaconda3/envs/arboreto/lib/python3.11/site-packages/arboreto/algo.py:39, in grnboost2(expression_data, gene_names, tf_names, client_or_address, early_stop_window_length, limit, seed, verbose)
     10 def grnboost2(expression_data,
     11               gene_names=None,
     12               tf_names='all',
   (...)
     16               seed=None,
     17               verbose=False):
     18     """
     19     Launch arboreto with [GRNBoost2] profile.
     20 
   (...)
     36     :return: a pandas DataFrame['TF', 'target', 'importance'] representing the inferred gene regulatory links.
     37     """
---> 39     return diy(expression_data=expression_data, regressor_type='GBM', regressor_kwargs=SGBM_KWARGS,
     40                gene_names=gene_names, tf_names=tf_names, client_or_address=client_or_address,
     41                early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose)

File ~/anaconda3/envs/arboreto/lib/python3.11/site-packages/arboreto/algo.py:120, in diy(expression_data, regressor_type, regressor_kwargs, gene_names, tf_names, client_or_address, early_stop_window_length, limit, seed, verbose)
    117 if verbose:
    118     print('creating dask graph')
--> 120 graph = create_graph(expression_matrix,
    121                      gene_names,
    122                      tf_names,
    123                      client=client,
    124                      regressor_type=regressor_type,
    125                      regressor_kwargs=regressor_kwargs,
    126                      early_stop_window_length=early_stop_window_length,
    127                      limit=limit,
    128                      seed=seed)
    130 if verbose:
    131     print('{} partitions'.format(graph.npartitions))

File ~/anaconda3/envs/arboreto/lib/python3.11/site-packages/arboreto/core.py:450, in create_graph(expression_matrix, gene_names, tf_names, regressor_type, regressor_kwargs, client, target_genes, limit, include_meta, early_stop_window_length, repartition_multiplier, seed)
    448 # gather the DataFrames into one distributed DataFrame
    449 all_links_df = from_delayed(delayed_link_dfs, meta=_GRN_SCHEMA)
--> 450 all_meta_df = from_delayed(delayed_meta_dfs, meta=_META_SCHEMA)
    452 # optionally limit the number of resulting regulatory links, descending by top importance
    453 if limit:

File ~/anaconda3/envs/arboreto/lib/python3.11/site-packages/dask_expr/io/_delayed.py:115, in from_delayed(dfs, meta, divisions, prefix, verify_meta)
    112     dfs = [dfs]
    114 if len(dfs) == 0:
--> 115     raise TypeError("Must supply at least one delayed object")
    117 if meta is None:
    118     meta = delayed(make_meta)(dfs[0]).compute()

TypeError: Must supply at least one delayed object
nsapoval commented 1 day ago

Hi,

I ran into the same issue recently while trying to run grnboost2 from a Python 3.12 conda environment with the default versions of dask and distributed.

I found a thread with the same bug on pySCENIC GitHub Issues: #561. It appears that this is caused by some recent (?) changes in dask/distributed packages. The proposed fix in the thread suggests installing the following versions: dask-expr==0.5.3 distributed==2024.2.1. I tried doing so in a Python 3.12 environment, but that led to the same error as you have reported in the pySCENIC thread.

I have then tried to rebuild the environment with Python 3.10.15 and dask-expr==0.5.3 distributed==2024.2.1. This change resulted in the code running properly to completion. Hopefully this can be of help to other users who encounter the same issue.

tl;dr: Python 3.10.15 + dask-expr==0.5.3 distributed==2024.2.1 works fine, newer versions of Python, dask, distributed lead to the bug above.