facebookresearch / rlmeta

RLMeta is a light-weight flexible framework for Distributed Reinforcement Learning Research.
MIT License
284 stars 28 forks source link

m_server::push time out and m_server::act time out #15

Closed lmlaaron closed 2 years ago

lmlaaron commented 2 years ago
m_server_name: "m_server"
m_server_addr: "127.0.0.1:4411"

r_server_name: "r_server"
r_server_addr: "127.0.0.1:4412"

c_server_name: "c_server"
c_server_addr: "127.0.0.1:4413"

train_device: "cuda:0"
infer_device: "cuda:0"

timeout: 180

env: "PongNoFrameskip-v4"
max_episode_steps: 2700

num_train_rollouts: 1 
num_train_workers: 1

num_eval_rollouts: 1
num_eval_workers: 1

replay_buffer_size: 1024 
prefetch: 2

batch_size: 32
lr: 3e-4
push_every_n_steps: 50

num_epochs: 1000
steps_per_epoch: 3000

num_eval_episodes: 20

train_seed: 123
eval_seed: 456

Here is what I got:

[2022-01-18 18:34:54,797][root][INFO] - {'m_server_name': 'm_server', 'm_server_addr': '127.0.0.1:4411', 'r_server_name': 'r_server', 'r_server_addr': '127.0.0.1:4412', 'c_server_name': 'c_server', 'c_server_addr': '127.0.0.1:4413', 'train_device': 'cuda:0', 'infer_device': 'cuda:0', 'env': 'PongNoFrameskip-v4', 'max_episode_steps': 2700, 'num_train_rollouts': 1, 'num_train_workers': 1, 'num_eval_rollouts': 1, 'num_eval_workers': 1, 'replay_buffer_size': 1024, 'prefetch': 2, 'batch_size': 8, 'lr': 0.0003, 'push_every_n_steps': 100, 'num_epochs': 20, 'steps_per_epoch': 300, 'num_eval_episodes': 20, 'train_seed': 123, 'eval_seed': 456}
[2022-01-18 18:35:08,193][root][INFO] - Warming up replay buffer: [    0 / 1024 ]
[2022-01-18 18:35:09,194][root][INFO] - Warming up replay buffer: [    0 / 1024 ]
[2022-01-18 18:35:10,196][root][INFO] - Warming up replay buffer: [    0 / 1024 ]
[2022-01-18 18:35:11,198][root][INFO] - Warming up replay buffer: [    0 / 1024 ]
[2022-01-18 18:35:12,220][root][INFO] - Warming up replay buffer: [    0 / 1024 ]
[2022-01-18 18:35:13,222][root][INFO] - Warming up replay buffer: [  894 / 1024 ]
[2022-01-18 18:35:14,228][root][INFO] - Warming up replay buffer: [  894 / 1024 ]
[2022-01-18 18:35:15,229][root][INFO] - Warming up replay buffer: [  894 / 1024 ]
[2022-01-18 18:35:16,231][root][INFO] - Warming up replay buffer: [ 1024 / 1024 ]
Exception in callback handle_task_exception(<Task finishe...) timed out')>) at /media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py:11
handle: <Handle handle_task_exception(<Task finishe...) timed out')>) at /media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py:11>
Traceback (most recent call last):
  File "/home/ml2558/miniconda3/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py", line 17, in handle_task_exception
    raise e
  File "/media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py", line 13, in handle_task_exception
    task.result()
  File "/media/research/ml2558/rlmeta/rlmeta/core/loop.py", line 161, in _run_loop
    stats = await self._run_episode(env, agent, index)
  File "/media/research/ml2558/rlmeta/rlmeta/core/loop.py", line 182, in _run_episode
    action = await agent.async_act(timestep)
  File "/media/research/ml2558/rlmeta/rlmeta/agents/ppo/ppo_agent.py", line 78, in async_act
    action, logpi, v = await self.model.async_act(
RuntimeError: Call (m_server::act) timed out
Error executing job with overrides: ['env=PongNoFrameskip-v4', 'num_epochs=20']
Traceback (most recent call last):
  File "/media/research/ml2558/rlmeta/examples/atari/ppo/atari_ppo.py", line 96, in main
    stats = agent.train(cfg.steps_per_epoch)
  File "/media/research/ml2558/rlmeta/rlmeta/agents/ppo/ppo_agent.py", line 139, in train
    self.model.push()
  File "/media/research/ml2558/rlmeta/rlmeta/core/model.py", line 69, in push
    self.client.sync(self.server_name, "push", state_dict)
RuntimeError: Call (m_server::<unknown>) timed out

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I tried to modify the timeout but seems with the same error. Any hint on how to resolve this?

xiaomengy commented 2 years ago

Hi Thanks for this feedback. To be honest, previously our work are mostly on our internal clusters which contains about 80 CPU cores per node. So we didn't see such issues. The issue seems mainly from the moolib and tensorpipe backend. We are working together with moolib team to resolve that. Let's keep this issue open and we will update our progress here.

lmlaaron commented 2 years ago

I also tried it on a AWS g3.8xlarge (32 CPUs and 2GPUs, 240G RAM, 8G RAM/GPU) instance and observed the same error.

xiaomengy commented 2 years ago

The reason of this issue should come from moolib. We have created https://github.com/facebookresearch/moolib/issues/6 to track that. And we will fix it ASAP.

xiaomengy commented 2 years ago

https://github.com/facebookresearch/moolib/pull/7 should fix the issue. Could you try to pull to the latest main and rebuild everything then try to run the example on hour local machine? I tried that on my personal desktop and it should work.

lmlaaron commented 2 years ago

It looks working on my desktop with the following configuration: CPU Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz GTX 1080 with 8G RAM Ubuntu 16.04 cuda 10.2 after pulling the latest main branch. Though I have to change examples/atari/ppo/conf/conf_ppo.yaml infer device to cuda:0 since I only have one cuda device on my desktop.