ZiwenZhuang / parkour

[CoRL 2023] Robot Parkour Learning
https://robot-parkour.github.io
MIT License
516 stars 92 forks source link

GPU memory #31

Open EscorpioZzz opened 5 months ago

EscorpioZzz commented 5 months ago

Hi, thank you for your great job about parkour. I encountered some problems about my GPU memory.It said that an illegal memory access was encountered.

ZiwenZhuang commented 5 months ago

Hi, due to isaacgym overrides the cuda backend of pytorch, it probabily skips some validity check when you operating tenors. Could you locate which line of code that triggers the illegal memory access error?

EscorpioZzz commented 5 months ago

maybe legged_robot_field.py,line 330.

EscorpioZzz commented 5 months ago

Besides that,it has shown CUDA out of memory before.I wondered if it may mean my GPU could not run this project?

ZiwenZhuang commented 5 months ago

Hmmm, line 330 in legged_robot_field.py is the process of translating obs_scale in to tensor if needed. This cannot be the code that triggers the error.

For the out-of-memory error, maybe there are too many short episodes (shorter than num_steps_per_env). Try checking the training procedure by remove the --headless command and decrease the num_envs in class env and num_rows, num_cols in class terrain. One possible reason that leads to "too many short episodes" is that, the robot is initialized to a terminate condition.

I run all my experience on Nvidia 3090 GPU with 23GB memory.

Maybe decrease the num_steps_per_env to 8 or 16 and try decrease num_envs when you start training.

EscorpioZzz commented 5 months ago

OH!Thanks for your recommendations!I guess I could run this project now.By the way, my Gpu is Nvidia 3060 with 6GB memory.I wondered if it could run this project well? Actually I am finishing my undergraduate thesis, that could be important to me. At last thanks for your great work again!

ZiwenZhuang commented 5 months ago

The lack of GPU memory could lower your sample efficiency, thus increasing the time of your experiment iterations.

I recommend renting at least one 3090 GPU or applying for some old but functioning GPU cluster in your school, e.g. P40 or A40 GPUs.

EscorpioZzz commented 5 months ago

I followed your recommendation then I run this project well. But now I meet another issue. When I played it, it said Segmentation fault (core dumped), actually I didn't do any changes based on your work. Thanks for your great work! I am doing some similar work, it really helps me a lot!

ZiwenZhuang commented 5 months ago

It seems you are using a 3060 laptop? What is your OS? It can't be windows with wsl or docker. I tried but gave up. lol

It probability because of using graphics that requires something like Vulkan, or simply leading to out of memory. Make sure your OS has proper display support (both hardware and software).

EscorpioZzz commented 5 months ago

Yes but I rented a 3090 GPU. My OS is Linux, Ubuntu 20.04.

ZiwenZhuang commented 5 months ago

Is it a cloud service virtual machine? play.py requires a display. You can try the official example script, e.g. isaacgym_preview4/python/examples/joint_monkey.py

EscorpioZzz commented 5 months ago

I used to play it at the very beginning. At that time I decreased the num_envs in class env and num_rows, num_cols in class terrain and played it by my own GPU. It could run but I found that my robots didn't learn anything. Then I followed another isuue and I successfully solved this problem. However now I cannot play it again.

EscorpioZzz commented 5 months ago

Is it a cloud service virtual machine? play.py requires a display. You can try the official example script, e.g. isaacgym_preview4/python/examples/joint_monkey.py

On AutoDL, but not virtual machine.

ZiwenZhuang commented 5 months ago

So, what's is the machine you run play.py? Can you successfully run the joint_monkey.py example from isaacgym?

EscorpioZzz commented 5 months ago

Oh I cannot run the example. I run play.py on my own laptop

EscorpioZzz commented 5 months ago

I think it is due to my GPU and my graphics card are unmatched?

EscorpioZzz commented 5 months ago

Sorry for disturbing you so many times. Now I found out where the problem be. The server I rent hasn't been installed the GUI. So I could run isaacgym by --headless but could not play it. I wonder how to solve this problem. Thanks sooooooo much!

ZiwenZhuang commented 5 months ago

You may use Nvidia docker. Please refer to the docker folder in isaacgym.

EscorpioZzz commented 3 months ago

I'm so sorry about disturbing u so many times but i found out a question right now. Actually I cannot find "A1FieldDistillCfgPPO.runner.pretrain_dataset.data_dir" in a1_field_distill_config.py. So how shuold I run data collection?

ZiwenZhuang commented 3 months ago

Oh, for data_dir, it is for training from a pre-collected trajectory. Maybe check out how scan_dir is used in the runner and the dataset.

For triggering the data collection process, please check scripts/collect.py, which is an example of how to start the data collection. You can either collect the trajectory completely from teacher policy or collect the trajectory with action label in the DAgger manner.