TimZaman / dotaclient

distributed RL spaghetti al arabiata
26 stars 7 forks source link

Optimizer drops out bc of rmq #3

Closed TimZaman closed 5 years ago

TimZaman commented 5 years ago

Optimizer drops out [here, worker 9 out of 12 optimizers] with below error. RMQ is fine. image

The other optimizers then drop out:

Traceback (most recent call last):
  File "optimizer.py", line 353, in <module>
    pretrained_model=args.pretrained_model,
  File "optimizer.py", line 335, in main
    dota_optimizer.run()
  File "optimizer.py", line 171, in run
    self.step(experiences=experiences)
  File "optimizer.py", line 205, in step
    loss = self.finish_episode(rewards=all_discounted_rewards, log_probs=all_logprobs)
  File "optimizer.py", line 118, in finish_episode
    loss.backward()
  File "/root/.local/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/root/dotaclient/distributed.py", line 37, in allreduce_params
    dist.all_reduce(has_grad_count, op=dist.ReduceOp.SUM) # [0. to world_size]
  File "/root/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 838, in all_reduce
    work.wait()
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:543] Connection closed by peer [10.20.94.2]:15019
TimZaman commented 5 years ago

Fixed, seems stable now.