Possible bug? (ValueError: The parameter loc has invalid values)

Hello,

First, let me thank you for open-sourcing this great framework. However, I am unable to run the training without getting the following error:

Traceback (most recent call last):
  File "train.py", line 95, in <module>
    train()
  File "train.py", line 47, in train
    sarl.run(num_learning_iterations=iterations, log_interval=cfg_train["learn"]["save_interval"])
  File "/media/data/users/erez/repos/bi-dexhands/main/bi-dexhands/algorithms/rl/ppo/ppo.py", line 142, in run
    actions, actions_log_prob, values, mu, sigma = self.actor_critic.act(current_obs, current_states)
  File "/media/data/users/erez/repos/bi-dexhands/main/bi-dexhands/algorithms/rl/ppo/module.py", line 77, in act
    distribution = MultivariateNormal(actions_mean, scale_tril=covariance)
  File "/home/ubuntu/miniconda3/envs/bidexhands/lib/python3.7/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/home/ubuntu/miniconda3/envs/bidexhands/lib/python3.7/site-packages/torch/distributions/distribution.py", line 53, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values

I've been trying to train using task=ShadowHandPushBlock, and no matter what I tried (various versions of python/pytorch/cuda/nvidia drivers), this error always pops-up in an undeterminitic manner (i.e. not the same iteration).

The environment I was using, that resulted with the attached traceback:

Python 3.7
Pytorch 1.8.1
Torchvision 0.9.1
CudaToolkit 11.1.1

Hello,

First, let me thank you for open-sourcing this great framework. However, I am unable to run the training without getting the following error:
Traceback (most recent call last):
  File "train.py", line 95, in <module>
    train()
  File "train.py", line 47, in train
    sarl.run(num_learning_iterations=iterations, log_interval=cfg_train["learn"]["save_interval"])
  File "/media/data/users/erez/repos/bi-dexhands/main/bi-dexhands/algorithms/rl/ppo/ppo.py", line 142, in run
    actions, actions_log_prob, values, mu, sigma = self.actor_critic.act(current_obs, current_states)
  File "/media/data/users/erez/repos/bi-dexhands/main/bi-dexhands/algorithms/rl/ppo/module.py", line 77, in act
    distribution = MultivariateNormal(actions_mean, scale_tril=covariance)
  File "/home/ubuntu/miniconda3/envs/bidexhands/lib/python3.7/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/home/ubuntu/miniconda3/envs/bidexhands/lib/python3.7/site-packages/torch/distributions/distribution.py", line 53, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values
I've been trying to train using task=ShadowHandPushBlock, and no matter what I tried (various versions of python/pytorch/cuda/nvidia drivers), this error always pops-up in an undeterminitic manner (i.e. not the same iteration).

The environment I was using, that resulted with the attached traceback:

Python 3.7

Pytorch 1.8.1

Torchvision 0.9.1

CudaToolkit 11.1.1

Hello @sheffier ,

Thanks for your report and appreciation! Yeah it's an isaacgym bug, we encountered it when we were using it. We believe that this is a bug caused by the instability of the isaacgym physics engine when dealing with collisions, so that the observations have a nan value, which leads to an error in the neural network training. Details can be found in isaacgym devtalk forum.

Unfortunately, since this is an isaacgym issue, I don't think there is a way to fully fix this bug right now. However, I think that two methods can be taken to deal with it temporarily: 1) When the nan value appears in the obs and reward in the input of the neural network, ignore it to ensure that the training can be performed normally temporarily. 2) Referring to the nvidia official recommendation, adjust the simulation parameters to reduce the probability of this bug. I feel that the following adjustments can reduce this bug better, you can try to see if it helps:

Open the environment parameter file and see the sim part: (such as https://github.com/PKU-MARL/DexterousHands/blob/main/bi-dexhands/cfg/shadow_hand_push_block.yaml)

sim:  
  substeps: 2 
  physx: 
    num_threads: 4 
    solver_type: 1  # 0: pgs, 1: tgs 
    num_position_iterations: 8 
    num_velocity_iterations: 0 
    contact_offset: 0.002 
    rest_offset: 0.0 
    bounce_threshold_velocity: 0.2 
    max_depenetration_velocity: 1000.0 
    default_buffer_size_multiplier: 5.0 
  flex: 
    num_outer_iterations: 5 
    num_inner_iterations: 20 
    warm_start: 0.8 
    relaxation: 0.75

I found that lowering the num_positioniterations parameter helped with this, so adjusted it to below:

sim:
  substeps: 2
  physx:
    num_threads: 4
    solver_type: 1  # 0: pgs, 1: tgs
    num_position_iterations: 4
    num_velocity_iterations: 0
    contact_offset: 0.002
    rest_offset: 0.0
    bounce_threshold_velocity: 0.2
    max_depenetration_velocity: 1000.0
    default_buffer_size_multiplier: 5.0
  flex:
    num_outer_iterations: 5
    num_inner_iterations: 20
    warm_start: 0.8
    relaxation: 0.75

Hope this can help you.

PKU-MARL / DexterousHands

Possible bug? (ValueError: The parameter loc has invalid values) #4