Closed Kermalis closed 3 months ago
Huh!
That's an exception coming from deep inside libtorch, and those are often "nonsensical," since they report an error in the context of the function that is called in the implementation, not at the highest-level API. I'll take a look at MultivariateNormal, but it would help if your code could be pared down to a small test with just the relevant tensors. Since this is a shape-related exception, what's inside the tensors probably doesn't matter, so random values are fine.
I'll try to do that if I don't find anything obvious, but if you have time to narrow it down and post the smaller code, that would speed things up.
In the meantime, you may be able to work around it by using another continuous distribution. It'll produce different results, no doubt, but not invalidate the model completely, I don't think.
I've written one, but I've got a different exception (System.Runtime.InteropServices.ExternalException:“shape '[3, 3]' is invalid for input of size 3”):
using TorchSharp.Modules;
using TorchSharp;
var actionMean = torch.tensor((double[])[0.2025, -0.0714, 0.1417]);
var covMat = torch.tensor(new double[,]
{
{ 0.36, 0, 0 },
{ 0, 0.36, 0 },
{ 0, 0, 0.36 },
});
var dist = new MultivariateNormal(actionMean, covariance_matrix: covMat);
torch.Tensor action = dist.sample();
torch.Tensor actionLogProb = dist.log_prob(action);
I would also try this in PyTorch later.
Yes, it works in PyTorch:
import torch
actionMean = torch.tensor([-0.019071, -0.026232, -0.074312])
covMat = torch.tensor([[0.36, 0, 0], [0, 0.36, 0], [0, 0, 0.36]])
dist = torch.distributions.MultivariateNormal(actionMean, covariance_matrix=covMat)
action = dist.sample()
actionLogProb = dist.log_prob(action)
print(actionLogProb) # tensor(-1.5809)
@yueyinqiu, your covMat
is missing a 1-sized dimension, I believe.
It's line 221:
( In Pytorch it's:
)
(TakeAllBut
works like bx.shape[:-outer_batch_dims]
, so their is "an extra -
" in TorchSharp.)
@yueyinqiu, your
covMat
is missing a 1-sized dimension, I believe.
Ah... yes, and it could also cause the problem.
That caused the different exception, but there's a bug somewhere in BatchMahalanobis
If anyone would like to help me stare at the code for a while, you are more than welcome to! :-)
@NiklasGustafsson
I have found the problem but I don't have much time to fix and test for it currently... https://github.com/dotnet/TorchSharp/issues/1334#issuecomment-2181206978
Line #221:
var bx_new_shape = TakeAllBut(bx.shape, outer_batch_dims).ToList();
should be:
var bx_new_shape = bx.shape.Take(outer_batch_dims).ToList();
It's probably worth making a new release for this.
@Kermalis, I don't know if you have the time and energy to copy the MultivariateNormal distribution to your own project (place it in a different namespace) and apply the fix there to verify it the context of PPO. If not, I'll make a release based on the fix working with the small repro @yueyinqiu posted.
Hello, I'm really grateful you guys took the time to look into this and got to the bottom of it so quickly :)
I'm fine with either solution honestly, but a new nuget release would be of course better. It looks like it might be worth it since I think there are quite a few pending changes on github compared to nuget
Hello, I'm really grateful you guys took the time to look into this and got to the bottom of it so quickly :)
I'm fine with either solution honestly, but a new nuget release would be of course better. It looks like it might be worth it since I think there are quite a few pending changes on github compared to nuget
Yes, a new release will happen. I would prefer to figure out whether this fix is enough, or whether there are further problems downstream from it. Since you have the ported PPO code, it would be valuable if you could test it in that context. If you don't have the time, I understand and will release without that deeper validation.
I should be able to try it out soon, I'll let you know how it goes
I just verified that the fix works on my end and I'm now able to train :) Thanks a bunch. I'll be looking forward to the next nuget releases
Okay. I'll merge and then do a release this afternoon.
Hello, I have ported over this code from pytorch: https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py
It's just a PPO implementation. However, this line is where I'm seeing a difference in behavior: https://github.com/nikhilbarhate99/PPO-PyTorch/blob/master/PPO.py#L97
I have tested this pytorch code locally with numpy and it works flawlessly. However, this equivalent line in C# is where I get an exception.
System.Runtime.InteropServices.ExternalException: 'shape '[1, 3, 1, 1, 3]' is invalid for input of size 3'
The C# equivalent looks like: (Ignore the lack of disposes for now)
I have checked each tensor individually and verified that the sizes are identical in C# and python. Below I have printed the
actionMean
,_actionVar
,covMat
andaction
tensors:Python:
C#:
Simply put, they are [3], [3], [1x3x3], and [1x3] respectively. I'm not sure why the error happens in TorchSharp and not pytorch unless it's just a bug. I don't understand the nonsensical dimensions the exception is talking about. I went over this all day and even talked with some LLMs and they're convinced the code is ported properly (and I am too at this point since I have verified the tensor sizes myself). Unfortunately I can only talk to LLMs since I don't know anyone in this field and I have no official education. Any help is appreciated here, I'd like to get this up and running in C#