aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
5.16k stars 316 forks source link

When aim server is killed, the client use run.close() will be hang #3067

Open yanxiaod123 opened 9 months ago

yanxiaod123 commented 9 months ago

🐛 Bug

At the beginning, the aim server was running properly. During the process, the server was killed. At the end, when close run,the following error was encountered and will be hang:

Exception in thread Thread-380: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.8/dist-packages/aim/ext/cleanup/init.py", line 87, in _cleanup finalizer() File "/usr/lib/python3.8/weakref.py", line 566, in call return info.func(info.args, **(info.kwargs or {})) File "/usr/local/lib/python3.8/dist-packages/aim/ext/transport/remote_resource.py", line 14, in _close self.rpc_client.release_resource(self.handler) File "/usr/local/lib/python3.8/dist-packages/aim/ext/transport/client.py", line 243, in release_resource response = self.remote.release_resource(request, metadata=self._request_metadata) File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in call return _end_unary_response_blocking(state, call, False, None) File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused" debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-12-18T19:27:56.563282663+08:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-12-18T19:27:56.563248824+08:00", grpc_status:14}]}"

import aim import time

run = aim.Run(repo="aim://****", log_system_params=True)

for i in range(100): print("current id is ", i) run["test"] = "test_" + str(i) time.sleep(2) print("start close") run.close() print("complete close")

Environment

Additional context