Open cagatayyildiz opened 4 years ago
This has been kind of troubling me and glad that you pointed this out. If you make the modifications, the system seems to collapse to almost the mode something like this
I see a similar modification made in the paper's Python code (Line 19 here).
My intuitive understanding here was that effectively the dynamics are "log likelihood maximization" + "regularization force in the opposite direction". What do you think?
NOTE: It is pretty easy to spin up a Google Colab from the README. That is self-contained. Might be worth playing around and figuring out which of the signs did I mess up to accommodate the outside one.
Comparing the kernel function and gradients to the paper's code, here's some sanity checks.
X_init = (3 * torch.randn(n, *gauss.event_shape)).to(device)
X = X_init.clone().requires_grad_(True)
kxy_pt = K(X, X.detach())
dxkxy_pt = autograd.grad(kxy_pt.sum(), X)[0]
kxy, dxkxy = svgd_kernel(X_init.cpu().numpy())
assert np.allclose(kxy, kxy_pt.detach().cpu().numpy(), rtol=1e-06, atol=1e-06)
assert np.allclose(dxkxy, -dxkxy_pt.detach().cpu().numpy(), rtol=1e-06, atol=1e-06)
print(np.max(dxkxy + dxkxy_pt.detach().cpu().numpy()))
X_init
is some initialization (in this standard normal of size 10 x 2
), and K
is an instance of my kernel method. Notice that autograd returns a gradient which precisely the same with the signs inverted. Also ran multiple trials of the above code to make sure the values align.
Here's a set of maximum difference values which well within the margin of error.
1.1973141393617492e-07
1.0310483256059655e-07
5.165759132674808e-07
2.4548847255001505e-07
8.211055531615052e-07
1.6124018087371184e-07
2.888325497663047e-07
2.0313685900053002e-07
1.0944325147532741e-07
1.4054520663941972e-07
Iny my understanding, the minus sign in author's code is due to the derivation that is hardcoded by the author. In your case, autograd does it, no need for a minus sign (but not sure). What happens when you compare dxkxy
against dxkxy_pt
? Maybe the values are already too small?
Check https://github.com/janhuenermann/svgd/blob/master/SVGD.py or https://github.com/dtak/ocbnn-public/blob/master/bnn/svgd.py#L52, no need for extra minus,
the minus sign in author's code is due to the derivation that is hardcoded by the author
Ah apologies, I misspoke.
What happens when you compare dxkxy against dxkxy_pt? Maybe the values are already too small?
I mentioned the sum of those values in different runs. They are very close. Anything else that you'd like me to try given the fact that the kernel evaluations are precisely the same?
Check https://github.com/janhuenermann/svgd/blob/master/SVGD.py or https://github.com/dtak/ocbnn-public/blob/master/bnn/svgd.py#L52, no need for extra minus,
I agree. There shouldn't be a need to manipulate the sign. I'm not sure why autograd
is giving me gradients with signs effectively inverted. Any ideas to debug this?
UPDATE: I don't particular see the difference b/w the TF2 code you referenced and my implementation except for the gradient sign. It is weird that it works on all the Mixture of Gaussian examples I have.
Let me know if you were able to investigate this further.
For those who are still interested in why there should be a minus sign in the code (otherwise the result is far from satisfying) , I have thoroughly investigated the algorithm provided in the paper, and found that the last term is not "actual gradient".
For example, let's consider a set of initial particles with only 2 points. According to the algorithm, \phi(x) should be
Or, in matrix form
However, our RBF kernel should be
which is the transpose of the matrix appeared in \phi(x). Although RBF kernel is symmetric and k(x) equals k(x)^T, the last derivative term should be manually provided with a minus sign due to a simple fact (more detailed explanation later):
Other than that, during coding, I find that there also should be a factor 1/2 before the derivative if we use autograd mechanism in PyTorch or Tensorflow. Before explanation, let's have a brief review of autograd mechanism. Here, the output is k(x), which is a matrix, and the input is x, which is a vector. The following is my code
kernel = kernel_rbf(inputs)
kernel_grad = torch.autograd.grad(kernel, inputs, torch.ones_like(kernel))[0]
The most confusing part (at least for me) is how torch.ones_like(kernel)
functions. As far as I know, the derivative of a matrix to a vector is not well-defined. PyTorch implements in the following way, like a weighted sum
Currently we have a=b=c=d=1 (or in author's code: sum(), which outputs the same result). Then, since
We have
Let's now go back to the matrix form \phi(x). The derivative term is
Finally! We have found out where the minus sign comes from. Also we find that there should be a factor 1/2. However, in author's code, why isn't there a 1/2 ? That is because the author manually calls detach() to the inputs while calculating kernel matrix
K_XX = self.K(X, X.detach())
Method detach() can separate a tensor from the computational graph, thus not providing any gradient. As a result, the autograd happens to provide a gradient that is half the "true gradient". But in my opinion, this is more like a coincidence, and maybe the following coding style is more heuristic
K_XX = self.K(X, X)
grad_K = -0.5*autograd.grad(K_XX.sum(), X)[0]
That's all! Took me pretty much time to find out why, but also very rewarding. I hope this answer can help you. I have also created a repository https://github.com/shwangtangjun/SVGD-PyTorch, which implements SVGD in a more pythonic style.
Hey @shwangtangjun ! Just noticed that you've taken much effort to figure this out. Thanks for this!
Hi,
I was reading your code, svgd/Stein.ipynb notebook in particular. I've noticed that in
SVGD.phi()
function, you havegrad_K = -autograd.grad(K_XX.sum(), X)[0]
. Is this correct? In my understanding,phi()
is supposed to compute the equation right above and there is no minus sign there. Could you take a look?Thanks, Cagatay.