support-pytorch-v0.4 branch: CUDA out of memory with 1080Ti 11GB memory

andyzeng / visual-pushing-grasping

Train robotic agents to learn to plan pushing and grasping actions for manipulation with deep reinforcement learning.

http://vpg.cs.princeton.edu/

BSD 2-Clause "Simplified" License

898 stars 315 forks source link

support-pytorch-v0.4 branch: CUDA out of memory with 1080Ti 11GB memory #12

Closed Kelvinson closed 5 years ago

Kelvinson commented 5 years ago

Hi, thanks for your great work, I want to run the code and use the following setup: branch: support-pytorch-v0.4 branch Python: Anaconda Python 3.6.5 Pytorch: 1.0.0 V-REP: v3.6 GPU: 1080Ti 11GB Memory OS: Ubuntu 16.04

I have 11GB memory but still come across the GPU memory problem.

run_out_of_memory

Could you please tell me what am I missing, thanks!

marvision-ai commented 5 years ago

I am having the same issue! Have you resolved this?

Kelvinson commented 5 years ago

nope, hope someone can patch it to be compatible with the latest version of Pytorch.

marvision-ai commented 5 years ago

Seems like when I run it I fail at the following line in trainer.py : output_prob, state_feat = self.model.forward(input_color_data, input_depth_data, is_volatile, specific_rotation)

It will call this function and fail. Not sure how I do not have enough memory for this?

def forward(self, input_color_data, input_depth_data, is_volatile=False, specific_rotation=-1):

md3011 commented 5 years ago

I am facing the same issue. Were you able to resolve this?

marvision-ai commented 5 years ago

No I was not able to.

st2yang commented 5 years ago

Changing torch.no_grad() to with torch.no_grad(): and putting all the following contents in its block might fix the issue. (program can run but not fully tested)

https://github.com/andyzeng/visual-pushing-grasping/blob/support-pytorch-v0.4/models.py#L218

marvision-ai commented 5 years ago

@st2yang thanks for replying! Sorry but I do not understand the following " Changing torch.no_grad() to with torch.no_grad(): "

I do not understand when you mean putting it all in a block. Do you mind elaborating more? Thank you :smile:

st2yang commented 5 years ago

In PyTorch 0.4.0 and so on, volatile flag is deprecated and no_grad() is used to disable the grad propagation. But it might need to be like this

x = torch.zeros(1, requires_grad=True) with torch.no_grad(): [note space here] y = x * 2 print(y.requires_grad)

If you run other way

x = torch.zeros(1, requires_grad=True) torch.no_grad() y = x * 2 print(y.requires_grad)

You can still find y.requires_grad True.

The reason of this issue in Andy's code is that it didn't disable the grad computation in "inference" mode (when is_volatile=True), and there're 16 rotations in "inference" so the grad computation goes crazy. This is how I thought.

Kelvinson commented 5 years ago

@st2yang thanks for the pointer, I guess that's probably the reason. I can do a thorough migration to the latest version and send a PR. BTW, do you think changing that will require a total retraining the model since it might change the computation underlying graph.

st2yang commented 5 years ago

That would be great if you are willing to migrate the project. I would recommend you to try to migrate to PyTorch 0.4.1 first and fully test it. Migrating to 0.4.1 should be much simpler and there is a pretty detailed official guide. And 0.4.1 is still widely used.

I think the provided weight file can still be used, at least for 0.4.1. The computation graph wouldn't change, and the weight file is saved in cpu mode and probably wouldn't cause dependency library version conflict issue.

gsbandy24 commented 3 years ago

Does this occur for you when you attempt to train the VPG policy or the reactive method?

I don't have this error when trainining VPG, but when i try to train the baseline "reactive" methods, i have this cuda memory error. Just want to see if my issue is the same as yours or a tad different.