Ericonaldo / visual_wholebody

Train a loco-manipulation dog with RL
https://wholebody-b1.github.io/
Other
65 stars 5 forks source link

error in high level student policy #8

Open whn981841576 opened 3 days ago

whn981841576 commented 3 days ago

i set teacher checkpoint in arg,but there is a issue Traceback (most recent call last): File "train_multi_bc_deter.py", line 404, in trainer.train() File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_trainer.py", line 70, in train self.single_agent_train() File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_trainer.py", line 140, in single_agent_train self.agents.record_transition(student_obs=student_obs, File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_rnn.py", line 238, in record_transition self.memory.add_samples(student_obs=student_obs, teacher_obs=teacher_obs, actions=actions, teacher_actions=teacher_actions, rewards=rewards, File "/home/scq/pycharm_project/visual_wholebody-main/third_party/skrl/skrl/memories/torch/base.py", line 266, in add_samples self.tensors[name][self.memoryindex].copy(tensor) TypeError: copy_(): argument 'other' (position 1) must be Tensor, not int

hatimwen commented 3 days ago

use PyTorch with version 2.X like 2.1.2

whn981841576 commented 2 days ago

use PyTorch with version 2.X like 2.1.2 thank you for your suggestion,when i use pytorch with version 2.12,i have solved this problem,but there is a new issue File "train_multi_bc_deter.py", line 404, in trainer.train() File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_trainer.py", line 70, in train self.single_agent_train() File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_trainer.py", line 153, in single_agent_train self.agents.post_interaction(timestep=timestep, timesteps=self.timesteps) File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_rnn.py", line 265, in post_interaction self._update(timestep, timesteps) File "/home/scq/pycharm_project/visual_wholebody-main/high-level/learning/dagger_rnn.py", line 338, in _update (dagger_loss + entropy_loss).backward() File "/home/scq/anaconda3/envs/ava/lib/python3.8/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/scq/anaconda3/envs/ava/lib/python3.8/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 792.00 MiB. GPU 0 has a total capacty of 23.68 GiB of which 866.69 MiB is free. Including non-PyTorch memory, this process has 22.47 GiB memory in use. Of the allocated memory 10.16 GiB is allocated by PyTorch, and 1.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

What troubles me is that the cuda of the teacher's policy did not exceed the required memory, and my graphics card is 3090, while also reducing the number of environments from 10240 to 4096

whn981841576 commented 2 days ago

use PyTorch with version 2.X like 2.1.2

when i use nvidia-smi, It shows that I only have half of my memory

hatimwen commented 2 days ago

My device is also 3090 and it works. So I suggest you check if there're other processes alive.

Btw, what about the performance of your trained teacher policy?