Closed GraV1337y closed 3 years ago
Seems like the sequential creation of envs is too fast for the build. We are willing to find out, if the use of a "lock-file" can fix this issue.
The real problem was the limit on the max number of processes per user, which was set by LiDo. One can change this limit e.g. with the ulimit -u 32768
command before executing the train script. 32768 threads are sufficient for 320 environments of the saving_training_single environment. Further research is required to determine an optimal value for the number of threads and environments.
Later in training, we get this error:
/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py:834: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
torch.nn.utils.clip_grad_norm_(self.actor_critic.parameters(), self.cfg.max_grad_norm)
Exception in thread Thread-4:
Traceback (most recent call last):
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run()
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py", line 1129, in _train_loop
self._process_training_data(data, timing, wait_stats)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py", line 1079, in _process_training_data
train_stats = self._train(buffer, batch_size, experience_size, timing)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/learner.py", line 725, in _train
result = self.actor_critic.forward_tail(core_outputs, with_action_distribution=True)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model.py", line 92, in forward_tail
action_distribution_params, action_distribution = self.action_parameterization(core_output)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model_utils.py", line 424, in forward
action_distribution = get_action_distribution(self.action_space, raw_logits=action_distribution_params)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 57, in get_action_distribution
return ContinuousActionDistribution(params=raw_logits)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 254, in __init__
normal_dist = Normal(self.means, self.stddevs)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/normal.py", line 50, in __init__
super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/distribution.py", line 53, in __init__
[2021-06-25 11:37:41,181][26466] Unknown exception on policy worker
Traceback (most recent call last):
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/policy_worker.py", line 245, in _run
self._handle_policy_steps(timing)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/policy_worker.py", line 108, in _handle_policy_steps
policy_outputs = self.actor_critic(observations, rnn_states)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model.py", line 112, in forward
result = self.forward_tail(x, with_action_distribution=with_action_distribution)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model.py", line 92, in forward_tail
action_distribution_params, action_distribution = self.action_parameterization(core_output)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/appo/model_utils.py", line 424, in forward
action_distribution = get_action_distribution(self.action_space, raw_logits=action_distribution_params)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 57, in get_action_distribution
return ContinuousActionDistribution(params=raw_logits)
File "/work/smyawege/job/multi-sample-factory/multi_sample_factory/algorithms/utils/action_distributions.py", line 254, in __init__
normal_dist = Normal(self.means, self.stddevs)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/normal.py", line 50, in __init__
super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "/work/smyawege/anaconda3/envs/multi-sample-factory/lib/python3.9/site-packages/torch/distributions/distribution.py", line 53, in __init__
raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values
raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values
ulimit -u 131072 is max with 480 envs
Problem with scaling the numbers of envs.