PG642 / multi-sample-factory

High throughput reinforcement learning on clusters
MIT License
2 stars 0 forks source link

Learner crashes with num_envs_per_worker > 2 #6

Closed GraV1337y closed 3 years ago

GraV1337y commented 3 years ago

Problem with scaling the numbers of envs.

[31m[2021-06-15 18:16:38,642][01756] Unknown exception in rollout worker
Traceback (most recent call last):
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/mlagents_envs/rpc_communicator.py", line 69, in create_server
    self.server.start()
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/grpc/_server.py", line 980, in start
    _start(self._state)
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/grpc/_server.py", line 936, in _start
    thread.start()
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/threading.py", line 874, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/work/smyawege/17538170/multi-sample-factory/multi_sample_factory/algorithms/appo/actor_worker.py", line 874, in _run
    self._init()
  File "/work/smyawege/17538170/multi-sample-factory/multi_sample_factory/algorithms/appo/actor_worker.py", line 737, in _init
    env_runner.init()
  File "/work/smyawege/17538170/multi-sample-factory/multi_sample_factory/algorithms/appo/actor_worker.py", line 357, in init
    env = make_env_func(self.cfg, env_config=env_config)
  File "/work/smyawege/17538170/multi-sample-factory/multi_sample_factory/algorithms/appo/appo_utils.py", line 38, in make_env_func
    env = create_env(cfg.env, cfg=cfg, env_config=env_config)
  File "/work/smyawege/17538170/multi-sample-factory/multi_sample_factory/envs/create_env.py", line 22, in create_env
    env = env_registry_entry.make_env_func(full_env_name, cfg=cfg, env_config=env_config)
  File "/work/smyawege/17538170/multi-sample-factory/multi_sample_factory_examples/train_rocket_league_env.py", line 43, in make_rocket_league_env_func
    unity_env = UnityEnvironment(file_name=full_env_name, seed=1, side_channels=[], worker_id=env_config.env_id)
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/mlagents_envs/environment.py", line 187, in __init__
    self._communicator = self._get_communicator(worker_id, base_port, timeout_wait)
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/mlagents_envs/environment.py", line 252, in _get_communicator
    return RpcCommunicator(worker_id, base_port, timeout_wait)
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/mlagents_envs/rpc_communicator.py", line 51, in __init__
    self.create_server()
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/mlagents_envs/rpc_communicator.py", line 72, in create_server
    raise UnityWorkerInUseException(self.worker_id)
mlagents_envs.exception.UnityWorkerInUseException: Couldn't start socket communication because worker number 51 is still in use. You may need to manually close a previously opened environment or use a different worker number.
[2021-06-15 18:16:38,652][01772] Unknown exception in rollout worker
Traceback (most recent call last):
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/mlagents_envs/rpc_communicator.py", line 69, in create_server
    self.server.start()
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/grpc/_server.py", line 980, in start
    _start(self._state)
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/site-packages/grpc/_server.py", line 936, in _start
    thread.start()
  File "/work/smyawege/anaconda3/envs/pip-multi-sample-factory/lib/python3.9/threading.py", line 874, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

During handling of the above exception, another exception occurred:
GraV1337y commented 3 years ago

Seems like the sequential creation of envs is too fast for the build. We are willing to find out, if the use of a "lock-file" can fix this issue.

KonstantinRamthun commented 3 years ago

The real problem was the limit on the max number of processes per user, which was set by LiDo. One can change this limit e.g. with the ulimit -u 32768 command before executing the train script. 32768 threads are sufficient for 320 environments of the saving_training_single environment. Further research is required to determine an optimal value for the number of threads and environments.

Later in training, we get this error:

/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py:834: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  torch.nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.cfg.max_grad_norm)
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py", line 1129, in _train_loop
    self._process_training_data(data, timing, wait_stats)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py", line 1079, in _process_training_data
    train_stats = self._train(buffer, batch_size, experience_size, timing)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py", line 725, in _train
    result = self.actor_critic.forward_tail(core_outputs, with_action_distribution=True)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model.py", line 92, in forward_tail
    action_distribution_params, action_distribution = self.action_parameterization(core_output)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model_utils.py", line 424, in forward
    action_distribution = get_action_distribution(self.action_space, raw_logits=action_distribution_params)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 57, in get_action_distribution
    return ContinuousActionDistribution(params=raw_logits)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 254, in __init__
    normal_dist = Normal(self.means, self.stddevs)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/distribution.py", line 53, in __init__
[2021-06-25 11:37:41,181][26466] Unknown exception on policy worker
Traceback (most recent call last):
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/policy_worker.py", line 245, in _run
    self._handle_policy_steps(timing)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/policy_worker.py", line 108, in _handle_policy_steps
    policy_outputs = self.actor_critic(observations, rnn_states)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model.py", line 112, in forward
    result = self.forward_tail(x, with_action_distribution=with_action_distribution)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model.py", line 92, in forward_tail
    action_distribution_params, action_distribution = self.action_parameterization(core_output)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model_utils.py", line 424, in forward
    action_distribution = get_action_distribution(self.action_space, raw_logits=action_distribution_params)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 57, in get_action_distribution
    return ContinuousActionDistribution(params=raw_logits)
  File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 254, in __init__
    normal_dist = Normal(self.means, self.stddevs)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/distribution.py", line 53, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values
GraV1337y commented 3 years ago

ulimit -u 131072 is max with 480 envs