Murali-group / Beeline

BEELINE: evaluation of algorithms for gene regulatory network inference
GNU General Public License v3.0
171 stars 53 forks source link

BLRunner stuck on "OSError: Timed out trying to connect to 'inproc://172.17.0.2/10/1'" while running arboreto #48

Closed Oakento closed 3 years ago

Oakento commented 3 years ago

Hi, I was trying to run grnbeeline/arboreto:base through BLRunner.py as the following command.

docker run --rm -v /home/abc/projects/Beeline:/data/ --expose=41269 grnbeeline/arboreto:base /bin/sh -c "time -v -o data/outputs/Synthetic/dyn-LI/dyn-LI-100-1/GENIE3/time.txt python runArboreto.py --algo=GENIE3 --inFile=data/inputs/Synthetic/dyn-LI/dyn-LI-100-1/GENIE3/ExpressionData.csv --outFile=data/outputs/Synthetic/dyn-LI/dyn-LI-100-1/GENIE3/outFile.txt "

However, an error occurred and the program stuck.

Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/opt/conda/lib/python3.7/asyncio/tasks.py", line 435, in wait_for
    await waiter
concurrent.futures._base.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 304, in _
    raise CommClosedError() from e
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f4a22da7250>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py:320> exception=OSError("Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 322, in connect
    _raise(error)
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 772, in live_comm
    **self.connection_args,
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 334, in connect
    _raise(error)
  File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/10/1' after 10 s: connect() didn't finish in time

The error is not stable that there is a probability of the error in different places in multiple attempts.

Additionally, the containers are running under docker's bridge network.

adyprat commented 3 years ago

Hi, The program should run despite that "tornado application error" inside the docker. According to the authors of Arboreto, you can ignore those errors (https://github.com/aertslab/arboreto/issues/10). So long as the docker is running, the algorithm should be running. I'm assuming you are trying to run GENIE3 on a large-ish dataset (thousands of genes?), which will take a while to complete. If the docker exits without any output, then let me know. Best, Aditya

smartpig-666 commented 1 year ago

I encountered the same problem. How did you solve it in the end?

`distributed.comm.inproc - WARNING - Closing dangling queue in Traceback (most recent call last): File "runArboreto.py", line 43, in main(sys.argv) File "runArboreto.py", line 32, in main network = genie3(inDF.to_numpy(), client_or_address = client, gene_names = inDF.columns) File "/opt/conda/lib/python3.7/site-packages/arboreto/algo.py", line 73, in genie3 limit=limit, seed=seed, verbose=verbose) File "/opt/conda/lib/python3.7/site-packages/arboreto/algo.py", line 135, in diy .compute(graph, sync=True) \ File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 2919, in compute result = self.gather(futures) File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 1993, in gather asynchronous=asynchronous, File "/opt/conda/lib/python3.7/site-packages/distributed/client.py", line 834, in sync self.loop, func, *args, callback_timeout=callback_timeout, **kwargs File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 339, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.7/site-packages/distributed/utils.py", line 323, in f result[0] = yield future File "/opt/conda/lib/python3.7/site-packages/tornado/gen.py", line 735, in run value = future.result() concurrent.futures._base.CancelledError tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb07674df90>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py:320> exception=OSError("Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: connect() didn't finish in time")>) Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback ret = callback() File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result future.result() File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 401, in _close await self._correct_state() File "/opt/conda/lib/python3.7/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal await self.scheduler_comm.retire_workers(workers=list(to_close)) File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 810, in send_recv_from_rpc comm = await self.live_comm() File "/opt/conda/lib/python3.7/site-packages/distributed/core.py", line 772, in live_comm **self.connection_args, File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.7/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: Timed out trying to connect to 'inproc://172.17.0.2/9/1' after 10 s: connect() didn't finish in time `