aertslab / arboreto

A scalable python-based framework for gene regulatory network inference using tree-based ensemble regressors.
BSD 3-Clause "New" or "Revised" License
50 stars 24 forks source link

error of 'distributed' when running GRNboost on server without internet connection #8

Open WeiCSong opened 6 years ago

WeiCSong commented 6 years ago

Hi arboreto author, i'm trying to run GRNboost on supercomputer server,which cannot connect internet. my code:

import pandas as pd from arboreto.utils import load_tf_names from arboreto.algo import grnboost2 if name == 'main': in_file = '1.1_exprMatrix_filtered_t.txt' tf_file = '1.2_inputTFs.txt' out_file = 'net1_grn_output.tsv' ex_matrix = pd.read_csv(in_file, sep='\t') tf_names = load_tf_names(tf_file) network = grnboost2(expression_data=ex_matrix, tf_names=tf_names) network.to_csv(out_file, sep='\t', index=False, header=False)

pandas and arboreto were installed successfully before i upload this task. I got following error message:

/lustre/home/acct-bmelgn/bmelgn-3/.conda/envs/mypython3/lib/python3.7/site-packages/distributed/utils.py:134: RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to '127.0.0.1': [Errno 101] Network is unreachable

I followed the example in https://arboreto.readthedocs.io/en/latest/ ,which does not import 'distributed'. But the error message seemed to tell me that 'distrbuted' is trying to connect internet. I wonder whether 'distributed' can be avoided when i run GRNboost. Is there any suggestion for running arboreto on server? Thanks for your help.

ps: at first, i followed example in https://arboreto.readthedocs.io/en/latest/examples.html,which indeed import 'distributed'. But now i use the code listed above(which also come from your example),which seems to have nothing to do with 'distributed'.

tmoerman commented 6 years ago

Hi @goubegou, thanks for raising this issue. This might indeed be annoying for multiple users.

The current implementation uses distributed even when no explicit Client is specified. Implicitly, a Client connected to a LocalCluster instance is used. A while ago, I filed following issue as a reminder for a future improvement to decouple arboreto from distributed, and use the dask multiprocessing scheduler on single node instead.

Does this error crash the program completely or does it continue despite the error? I'll look into it when I find the time.

PS: a similar issue has been filed on the distributed github page.