MIT-REALM / neural_clbf

Toolkit for learning controllers based on robust control Lyapunov barrier functions
BSD 3-Clause "New" or "Revised" License
134 stars 44 forks source link

Fixing Inplace Error that Made Weird Autograd Errors #10

Closed kwesiRutledge closed 1 year ago

kwesiRutledge commented 1 year ago

Finally fixed the weird errors that autograd was throwing about in-place operations. After changing these inplace operations in the NeuralCLBFController() class, the example train_inverted_pendulum.py was able to run with (mostly) no errors.

Feel free to reject these if you don't think that they are useful @dawsonc !

dawsonc commented 1 year ago

Hi Kwesi! Thanks for the PR! Sorry it took me so long to realize it was here!

Which autograd errors does this fix? Could you please make an issue on this repo with the errors you're getting and the steps to reproduce? Context for my confusion: train_inverted_pendulum.py runs with no errors on my laptop :thinking: . AFAIK the +=, *=, etc. operators aren't in-place operations; they're sugar for x += y => x = x + y. Maybe there's something subtle going on that makes this run differently between machines.

Also, this PR looks like it includes all of the code for the new pusher-slider example. That's great, but it should probably go in a different PR (e.g. 1 for autograd errors, 1 to add the pusher slider).

kwesiRutledge commented 1 year ago

Lots of interesting things here. Thanks for your thorough response:

Which autograd errors does this fix? Could you please make an issue on this repo with the errors you're getting and the steps to reproduce?

The specific autograd error is the following:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [64]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
Epoch 0:   0%|          | 0/157 [00:02<?, ?it/s]

Added an extra line so that it shows this error occurred before the first epoch began.

I can create an issue later on today (or tomorrow, depending on when this plane lands)! This error appeared for me when running any of the training scripts in the train directory. I originally thought that it might be due to Pusher-Slider code, but attempted to run the current scripts and received the same error.

Context for my confusion: train_inverted_pendulum.py runs with no errors on my laptop 🤔 . AFAIK the +=, *=, etc. operators aren't in-place operations; they're sugar for x += y => x = x + y. Maybe there's something subtle going on that makes this run differently between machines.

It's possible that the problem is caused by different machines. This forum post leads me to believe that += and *= are inplace operations, though: discuss.pytorch. Regardless of whether or not it is, I am very curious why pytorch doesn't complain on your machine. I'm running the newest versions of the libraries; Are any of yours older?

Also, this PR looks like it includes all of the code for the new pusher-slider example. That's great, but it should probably go in a different PR (e.g. 1 for autograd errors, 1 to add the pusher slider).

This is super surprising! Github marked the Pusher-Slider stuff as having been committed 2 weeks ago, but I committed that stuff yesterday! Now, I know that GitHub includes ALL changes on a branch into a pull request (not just the ones that were available at the time of pull request creation). I'll make a note to fix this.

A possible compromise: If you don't see the issue on your machine with updated libraries, then feel free to close/archive the pull request which will make this discussion visible to any M1 Mac users that may run into the same issues. :)

dawsonc commented 1 year ago

I'm open to keeping this PR open and tracking down this inplace bug. The first step would be to make an issue that includes steps to reproduce and the full stack trace that the error prints out. Even if it works on my machine, the fact that it doesn't work on your machine is troubling and we should try to dig into it.

kwesiRutledge commented 1 year ago

This pull request was referenced in a new issue, following Charles' guidelines. This PR will be closed and a new one will be created containing only the proposed "inplace" changes.