CityBrainChallenge / KDDCup2021-CityBrainChallenge-starter-kit

75 stars 40 forks source link

Aborted (core dumped) when connecting to Ray cluster #54

Open Enoch2090 opened 3 years ago

Enoch2090 commented 3 years ago

When running the demo rllib_train.py or any Ray/Tune script in the docker enviroment working, it gives

/usr/local/lib/python3.7/dist-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  "update your install command.", FutureWarning)
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
The cfg files of this training    ['/starter-kit/cfg/simulator_round3_flow0.cfg']
2021-06-21 06:45:48,737 INFO worker.py:641 -- Connecting to existing Ray cluster at address: 172.20.0.XX:XX (our ip addr)
Aborted (core dumped)

Using command ray status further gives

======== Cluster status: 2021-06-18 10:19:03.171533 ========
Node status
------------------------------------------------------------
 1 node(s) with resources: {'memory': 50757456896.0, 'object_store_memory': 10000000000.0, 'CPU': 32.0, 'node:172.20.XX.XX': 1.0}
 1 node(s) with resources: {'object_store_memory': 10000000000.0, 'CPU': 72.0, 'node:172.20.XX.XX': 1.0, 'memory': 191201684480.0}
 1 node(s) with resources: {'node:172.20.XX.XX': 1.0, 'object_store_memory': 10000000000.0, 'CPU': 72.0, 'memory': 191201696768.0}
 1 node(s) with resources: {'memory': 191201438720.0, 'object_store_memory': 10000000000.0, 'CPU': 72.0, 'node:172.20.XX.XX': 1.0}
 1 node(s) with resources: {'object_store_memory': 10000000000.0, 'CPU': 32.0, 'node:172.17.XX.XX': 1.0, 'memory': 57459674112.0}

Resources
------------------------------------------------------------
Usage:
 0.0/280.0 CPU
 0.00/634.996 GiB memory
 0.00/46.566 GiB object_store_memory

Demands:
 (no resource demands)
The autoscaler failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/monitor.py", line 284, in run
    self._run()
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/monitor.py", line 175, in _run
    self.update_load_metrics()
  File "/usr/local/lib/python3.7/dist-packages/ray/_private/monitor.py", line 140, in update_load_metrics
    request, timeout=4)
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "{"created":"@1624011577.095924301","description":"Error received from peer ipv4:172.20.0.48:40499","file":"src/core/lib/surface/call.cc","file_line":1066,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

Looks like the cluster need a restart? Not sure what to do with this...

Kanstarry9T commented 3 years ago

Are you a member or leader of the team "two_slices_of_bread_with_cheese"? Please provide your team name, so we can help to solve this problem.

Enoch2090 commented 3 years ago

Are you a member or leader of the team "two_slices_of_bread_with_cheese"? Please provide your team name, so we can help to solve this problem.

I am from team IntelligentLight, thanks!

Kanstarry9T commented 3 years ago

We have restarted your computing cluster. Thanks!

Enoch2090 commented 3 years ago

We have restarted your computing cluster. Thanks!

Sorry, I think the cluster crashed again... This time our server is also unreachable. Please help us reboot them......Thanks in advance

Kanstarry9T commented 3 years ago

We have fixed the crashed cluster, you can connect to the cluster now.