Closed abcamiletto closed 10 months ago
Hi @abcamiletto, finally got some time to look into this. This seems to be a problem with vmap
, but I'm actually a bit confused about what you are trying to do in this example. Why the need to modify the parameter's .data
value directly? The typical use case is that one keeps this constant during torch's forward pass, then torch modifies the parameters' values in the backward pass.
I do agree this example does not really make sense.
But imagine a more complex component where the variable we want to optimize is a nn.Parameter of a nn.Module. This can happen in many ways either with custom modules or off the shelf ones too (eg nn.Conv2d).
The first solution one (or me at least) would think of is plugging in and out this parameter into the component at the beginning and end of the objective function (here the .data assignment) and use the component as one would always do.
This of course cannot work, due to the theseus reliance on torch.vmap. The workaround is indeed very simple in most cases (use the functional F.conv2d instead of the nn.Module), but not straightforward given the errors that are raised when the first try is attempted.
That was of course my experience, and I had to go through the theseus internals a bit to understand what was happening.
Thanks for your response @abcamiletto. To be sure I understand, is your claim that using an nn.Parameter
as an optimization variable results in error when using vmap
?
BTW, using vmap
is not mandatory. You can set autograd_mode="dense"
in the AutiDiffCostFunction
to avoid using vmap
(but it will probably be much slower).
That being said, I'm still curious about your use case.
Thanks for your response @abcamiletto. To be sure I understand, is your claim that using an
nn.Parameter
as an optimization variable results in error when usingvmap
?
Yes, in fact the example that I posted crashes with some obscure error.
BTW, using
vmap
is not mandatory. You can setautograd_mode="dense"
in theAutiDiffCostFunction
to avoid usingvmap
(but it will probably be much slower).That being said, I'm still curious about your use case.
That is interesting. Is autograd_mode=dense
making a simple for loop? Or where is the speed loss coming from? @luisenp
I am developing a robotics library to do some kinematic and dynamic computations differentiably. For the sake of a simpler API, the joints are nn.Module and keep their value as nn.Parameter.
This allows me to hide the complexity of joint parametrization to the external world (Euler angles? quaternion? 6D rotation? doesn't matter), but it also broke down when I tried to use theseus with it at first. I had to redesign it such that the joint value could also be given from the "outside".
As a last note: I really am a big fan of this library. I think it has great potential and plenty of applications. Thank you so much for developing it.
Thanks for you kind words, glad you like the library!
Setting autograd_mode="dense"
relies on the original torch.autograd.functional.jacobian
implementation, which unnecessarily computes gradients over the batch dimension that we then have to filter out (while vmap directly computes per-sample gradients); this is why we call it "dense". And we also have a version that loops over the batch, which you can use with autograd_mode="loop_batch"
. Both are much slower than using "vmap".
One part of the nn.Parameter
use case I'm confused about is that it looks like you are optimizing the same set of tensors at both optimization levels (outer level with torch, inner level with theseus), which means both optimizers are competing with each other, and this seems a bit strange. This is not a case we have ever considered. In all of our examples, nn.Parameter
optimized by torch (e.g., via Adam) are used either as auxiliary variables, or secondary quantities used to compute cost functions and cost weights. For this use case, we have tested extensively with very complicated nn.Module
objects as part of the cost function computation without major issues.
I see; now I get the point. Thanks for the answer.
In my application, I am not optimizing over the same set of tensors at the two levels. It just so happens that, due to how the library was designed, the quantity I want to optimize is stored as a nn.Parameter. This made things a bit impractical (and raised very confusing errors).
🐛 Bug
A weird segmentation fault can arise when optimizing a function that goes through a nn.Parameter.
The problem may be related to theseus itself or rather to torch.vmap, but this is quite hard to debug.
I found a workaround but figured I would post the reproducible bug anyway.
Thanks in advance for looking into this and for the fantastic library!