Improbable-AI / walk-these-ways

Sim-to-real RL training and deployment tools for the Unitree Go1 robot.
https://gmargo11.github.io/walk-these-ways/
Other
488 stars 129 forks source link

ValueError invalid values #49

Closed willxxy closed 2 weeks ago

willxxy commented 10 months ago

I get the following error when I run train.py script.

  File "scripts/train.py", line 256, in <module>
    train_go1(headless=False)
  File "scripts/train.py", line 216, in train_go1
    runner.learn(num_learning_iterations=100000, init_at_random_ep_len=True, eval_freq=100)
  File "/data/william/walk-these-ways/go1_gym_learn/ppo_cse/__init__.py", line 204, in learn
    mean_value_loss, mean_surrogate_loss, mean_adaptation_module_loss, mean_decoder_loss, mean_decoder_loss_student, mean_adaptation_module_test_loss, mean_decoder_test_loss, mean_decoder_test_loss_student = self.alg.update()
  File "/data/william/walk-these-ways/go1_gym_learn/ppo_cse/ppo.py", line 110, in update
    self.actor_critic.act(obs_history_batch, masks=masks_batch)
  File "/data/william/walk-these-ways/go1_gym_learn/ppo_cse/actor_critic.py", line 119, in act
    self.update_distribution(observation_history)
  File "/data/william/walk-these-ways/go1_gym_learn/ppo_cse/actor_critic.py", line 116, in update_distribution
    self.distribution = Normal(mean, mean * 0. + self.std)
  File "/home/william/anaconda3/envs/rob/lib/python3.8/site-packages/torch/distributions/normal.py", line 50, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/william/anaconda3/envs/rob/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (24000, 12)) of distribution Normal(loc: torch.Size([24000, 12]), scale: torch.Size([24000, 12])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<AddmmBackward0>)
gmargo11 commented 6 months ago

Hi @willxxy ,

Sorry to leave this issue for such a long time. We've recently noticed a similar issue with some newer versions of PyTorch. If you still encounter this error (or any future user comes along this post), can you please confirm your PyTorch version and try running the script in an environment with torch==1.10.0+cu113?