aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.93k stars 298 forks source link

Intereactive RPCerror during distributed training #2259

Open vishalghor opened 1 year ago

vishalghor commented 1 year ago

🐛 Bug

I am trying to use aim remote serrver to track experiments. I'm able to use the aim remote server without any issues when training with a single GPU but I get an rpc error when using distributed training.

E1006 20:00:32.140177015    1201 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1006 20:00:32.145466358    1201 client_channel.cc:647]      chand=0x55b608ae9f80: Illegal keepalive throttling value 9223372036854775807
Remote Server is unavailable, please check network connection: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Socket closed"
    debug_error_string = "{"created":"@1665086432.145674191","description":"Error received from peer ipv4:10.212.208.5:6006","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Socket closed","grpc_status":14}"
>, attempt: 1
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/prefect/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 48, in worker
    if self._try_exec_task(task_f, *args):
  File "/home/prefect/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 69, in _try_exec_task
    raise e
  File "/home/prefect/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 65, in _try_exec_task
    task_f(*args)
  File "/home/prefect/.local/lib/python3.7/site-packages/aim/ext/transport/client.py", line 150, in _run_write_instructions
    response = self.remote.run_write_instructions(message_stream_generator())
  File "/home/prefect/.local/lib/python3.7/site-packages/grpc/_channel.py", line 1131, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/prefect/.local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = "Exception iterating requests!"
    debug_error_string = "None"

To reproduce

train on distributed flow along with remote aim server.

Expected behavior

Logs training for the distributed training same as single-gpu training to the remote aim server

Environment

Additional context

N/A

gorarakelyan commented 1 year ago

thanks @vishalghor. looking into it.

vishalghor commented 1 year ago

@gorarakelyan after specifically setting the tracking to rank_0(which I wasn't earlier) to avoid any issues due to not setting it. Post that I currently get this message when using distributed training.

E1011 21:11:42.654337711     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654394046     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654412666     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654437563     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654451745     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654464466     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654477735     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654491383     318 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
E1011 21:11:42.654526841     318 client_channel.cc:647]      chand=0x557868333fe0: Illegal keepalive throttling value 9223372036854775807
Remote Server is unavailable, please check network connection: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1665522702.654717990","description":"Error received from peer ipv4:10.212.208.5:6006","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Socket closed","grpc_status":14}"
>, attempt: 1
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 48, in worker
    if self._try_exec_task(task_f, *args):
  File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 69, in _try_exec_task
    raise e
  File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/rpc_queue.py", line 65, in _try_exec_task
    task_f(*args)
  File "/home/user/.local/lib/python3.7/site-packages/aim/ext/transport/client.py", line 150, in _run_write_instructions
    response = self.remote.run_write_instructions(message_stream_generator())
  File "/home/user/.local/lib/python3.7/site-packages/grpc/_channel.py", line 1131, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/user/.local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
gorarakelyan commented 1 year ago

@vishalghor do you track logs from the worker nodes as well during distributed tracking?

vishalghor commented 1 year ago

@gorarakelyan no I am tracking from only the node0/rank0 during the distributed training. Also, I have seen the error on single GPU training as well and I believe the issue could be similar to https://github.com/aimhubio/aim/issues/1297. Maybe I haven't mentioned earlier but i'm using the aim remote server for tracking experiments(https://aimstack.readthedocs.io/en/latest/using/remote_tracking.html)

gorarakelyan commented 1 year ago

@vishalghor thanks. @mihran113 is currently investigating the issue. We will share more details and progress here soon.

mihran113 commented 1 year ago

Hey @vishalghor. It is apparently the same issue that you've mentioned. We've struggled a lot to reproduce it on our side, no luck so far. https://github.com/aimhubio/aim/issues/1297#issuecomment-1098978362 Could you please try this suggestion and see if it helps, so it won't block you further. In the meantime I'll try to reproduce and find a fix.

vishalghor commented 1 year ago

@mihran113 I did try these suggestions but it didn't help in resolving the issue. Thank you for looking into this

alberttorosyan commented 2 months ago

Hi @vishalghor! The latest version of Aim uses different protocol to send the data to remote server (based on HTTP and websockets). Could you please try it out? This issue will certainly be gone as gRPC is not sued anymore; just want to make sure the new version works fine in your setup.