Open vishalghor opened 1 year ago
thanks @vishalghor. looking into it.
@gorarakelyan after specifically setting the tracking to rank_0(which I wasn't earlier) to avoid any issues due to not setting it. Post that I currently get this message when using distributed training.
E1011 21:11:42.654337711 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654394046 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654412666 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654437563 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654451745 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654464466 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654477735 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654491383 318 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654526841 318 client_channel.cc:647] chand=0x557868333fe0: Illegal keepalive throttling value 9223372036854775807
Remote Server is unavailable, please check network connection: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "{"created":"@1665522702.654717990","description":"Error received from peer ipv4:10.212.208.5:6006","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Socket closed","grpc_status":14}"
>, attempt: 1
Exception in thread Thread-2:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 48, in worker
if self._try_exec_task(task_f, *args):
File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 69, in _try_exec_task
raise e
File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 65, in _try_exec_task
task_f(*args)
File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/client.py", line 150, in _run_write_instructions
response = self.remote.run_write_instructions(message_stream_generator())
File "/home/user/.local/lib/python3.7/site-packages/grpc/_channel.py", line 1131, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/user/.local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating requests!"
debug_error_string = "None"
@vishalghor do you track logs from the worker nodes as well during distributed tracking?
@gorarakelyan no I am tracking from only the node0/rank0 during the distributed training. Also, I have seen the error on single GPU training as well and I believe the issue could be similar to https://github.com/aimhubio/aim/issues/1297. Maybe I haven't mentioned earlier but i'm using the aim remote server for tracking experiments(https://aimstack.readthedocs.io/en/latest/using/remote_tracking.html)
@vishalghor thanks. @mihran113 is currently investigating the issue. We will share more details and progress here soon.
Hey @vishalghor. It is apparently the same issue that you've mentioned. We've struggled a lot to reproduce it on our side, no luck so far. https://github.com/aimhubio/aim/issues/1297#issuecomment-1098978362 Could you please try this suggestion and see if it helps, so it won't block you further. In the meantime I'll try to reproduce and find a fix.
@mihran113 I did try these suggestions but it didn't help in resolving the issue. Thank you for looking into this
Hi @vishalghor! The latest version of Aim uses different protocol to send the data to remote server (based on HTTP and websockets). Could you please try it out? This issue will certainly be gone as gRPC is not sued anymore; just want to make sure the new version works fine in your setup.
🐛 Bug
I am trying to use aim remote serrver to track experiments. I'm able to use the aim remote server without any issues when training with a single GPU but I get an rpc error when using distributed training.
To reproduce
train on distributed flow along with remote aim server.
Expected behavior
Logs training for the distributed training same as single-gpu training to the remote aim server
Environment
Additional context
N/A