RuntimeError while running training script.

harishbommakanti commented 3 years ago

Hi,

I went through the steps as indicated on the README of the main page and wanted to run an experiment using python scripts/train.py. This error also seems to pop up when I use the --variant flag and try other specific configurations. Attached is the log printed to the stdout. A large JSON is printed out which I think is fine since that should be the configuration of the experiment that's run, but at the end I get the following message:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I haven't made any changes to the directory structure or code except move utils to scripts to make the training script work. I'm also very new to PyTorch so I'm not sure if the error message hints at something wrong with my configuration or with the logic of the code. Any help would be appreciated! For reference, I'm running this on an Ubuntu 20.04 machine.

rb_bench_errormessage.txt

cremebrule commented 3 years ago

Hmmmm, it sounds like it's a PyTorch related issue. Which version are you currently running? Can you verify that it's version 1.4.0?

harishbommakanti commented 3 years ago

Oh, that looks like its the problem. I had version 1.6.0 installed but it worked when I changed the version. Thanks!

nitishsrivastava commented 3 years ago

I ran into the same error. I found that pytorch>=1.5 improves in-place modification detection for optimizers https://github.com/pytorch/pytorch/issues/39141#issuecomment-636881953. This seems to be catching a small bug in the rlkit version pointed to by git reset --hard f136e140a57078c4f0f665051df74dffb1351f33, in file rlkit/torch/sac/sac.py:

139         self.qf2_optimizer.zero_grad()
140         qf2_loss.backward()
141         self.qf2_optimizer.step()  # <-- updates critic weights
142
143         self.policy_optimizer.zero_grad()
144         policy_loss.backward()  # needs original critic weights to backprop, causes RuntimeError in pytorch>=1.5
145         self.policy_optimizer.step()

Switching to pytorch 1.4 makes the error go away, but will not fix this bug. I solved the problem by re-ordering the updates and the loss computation.

rojas70 commented 3 years ago

@ nitishsrivastava do you issue a pull request on the fix?

ARISE-Initiative / robosuite-benchmark

RuntimeError while running training script. #1