chengxuxin / extreme-parkour

Train your parkour robot in less than 20 hours.
https://extreme-parkour.github.io
Other
472 stars 85 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 106.00 MiB (GPU 0; 11.77 GiB total capacity; 6.69 GiB already allocated; 86.62 MiB free; 6.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF #23

Open XanderWangXY opened 5 months ago

XanderWangXY commented 5 months ago

Hello. I ran into the following problem while running the distillation strategy. How can I solve it? The command is python train.py --exptid yyy-yy-WHATEVER --device cuda:0 --resume --resumeid xxx-xx --delay --use_camera


Creating env... 100%|████████████████████████████████████████| 192/192 [00:00<00:00, 303.19it/s] /home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( Using MLP and Priviliged Env encoder ActorCritic structure ActorCritic.init got unexpected arguments, which will be ignored: ['continue_from_last_std', 'priv_encoder_dims', 'rnn_hidden_size', 'rnn_num_layers', 'rnn_type', 'tanh_encoder_output'] /home/adc/extreme-parkour/legged_gym/logs/parkour_new/xxx-xx -1


Loading model from /home/adc/extreme-parkour/legged_gym/logs/parkour_new/xxx-xx-WHATEVER/model_4800.pt... /home/adc/extreme-parkour/rsl_rl/rsl_rl/runners/on_policy_runner.py:506: UserWarning: 'depth_encoder_state_dict' key does not exist, not loading depth encoder... warnings.warn("'depth_encoder_state_dict' key does not exist, not loading depth encoder...") No saved depth actor, Copying actor critic actor to depth actor...


Traceback (most recent call last): File "train.py", line 70, in train(args) File "train.py", line 65, in train ppo_runner.learn(num_learning_iterations=train_cfg.runner.max_iterations, init_at_random_ep_len=True) File "/home/adc/extreme-parkour/rsl_rl/rsl_rl/runners/on_policy_runner.py", line 255, in learn_vision depth_latent_and_yaw = self.alg.depth_encoder(infos["depth"].clone(), obs_prop_depth) # clone is crucial to avoid in-place operation File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/adc/extreme-parkour/rsl_rl/rsl_rl/modules/depth_backbone.py", line 32, in forward depth_image = self.base_backbone(depth_image) File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/adc/extreme-parkour/rsl_rl/rsl_rl/modules/depth_backbone.py", line 98, in forward images_compressed = self.image_compression(images.unsqueeze(1)) File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/adc/anaconda3/envs/parkour/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 106.00 MiB (GPU 0; 11.77 GiB total capacity; 6.69 GiB already allocated; 86.62 MiB free; 6.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

XanderWangXY commented 5 months ago

I have solved this problem. By changing /extreme-parkour/legged_gym/legged_gym/envs/base/legged_robot_config.py line 393

1242713693 commented 2 months ago

According to the operation procedure, the system reported an error, I tried many ways, but could not solve, I hope to get your help, thank you very much dc60fbd6-9a1f-4787-b004-08069814045f