activatedgeek / svgd

PyTorch implementation of Stein Variational Gradient Descent
https://sanyamkapoor.com/kb/the-stein-gradient
40 stars 4 forks source link

Bug in SVGD.phi()? #1

Open cagatayyildiz opened 4 years ago

cagatayyildiz commented 4 years ago

Hi,

I was reading your code, svgd/Stein.ipynb notebook in particular. I've noticed that in SVGD.phi() function, you have grad_K = -autograd.grad(K_XX.sum(), X)[0]. Is this correct? In my understanding, phi() is supposed to compute the equation right above and there is no minus sign there. Could you take a look?

Thanks, Cagatay.

activatedgeek commented 4 years ago

This has been kind of troubling me and glad that you pointed this out. If you make the modifications, the system seems to collapse to almost the mode something like this

visualization

I see a similar modification made in the paper's Python code (Line 19 here).

My intuitive understanding here was that effectively the dynamics are "log likelihood maximization" + "regularization force in the opposite direction". What do you think?

NOTE: It is pretty easy to spin up a Google Colab from the README. That is self-contained. Might be worth playing around and figuring out which of the signs did I mess up to accommodate the outside one.

activatedgeek commented 4 years ago

Comparing the kernel function and gradients to the paper's code, here's some sanity checks.

  X_init = (3 * torch.randn(n, *gauss.event_shape)).to(device)

  X = X_init.clone().requires_grad_(True)
  kxy_pt = K(X, X.detach())
  dxkxy_pt = autograd.grad(kxy_pt.sum(), X)[0]

  kxy, dxkxy = svgd_kernel(X_init.cpu().numpy())

  assert np.allclose(kxy, kxy_pt.detach().cpu().numpy(), rtol=1e-06, atol=1e-06)
  assert np.allclose(dxkxy, -dxkxy_pt.detach().cpu().numpy(), rtol=1e-06, atol=1e-06)
  print(np.max(dxkxy + dxkxy_pt.detach().cpu().numpy()))

X_init is some initialization (in this standard normal of size 10 x 2), and K is an instance of my kernel method. Notice that autograd returns a gradient which precisely the same with the signs inverted. Also ran multiple trials of the above code to make sure the values align.

Here's a set of maximum difference values which well within the margin of error.

1.1973141393617492e-07
1.0310483256059655e-07
5.165759132674808e-07
2.4548847255001505e-07
8.211055531615052e-07
1.6124018087371184e-07
2.888325497663047e-07
2.0313685900053002e-07
1.0944325147532741e-07
1.4054520663941972e-07
cagatayyildiz commented 4 years ago

Iny my understanding, the minus sign in author's code is due to the derivation that is hardcoded by the author. In your case, autograd does it, no need for a minus sign (but not sure). What happens when you compare dxkxy against dxkxy_pt? Maybe the values are already too small?

cagatayyildiz commented 4 years ago

Check https://github.com/janhuenermann/svgd/blob/master/SVGD.py or https://github.com/dtak/ocbnn-public/blob/master/bnn/svgd.py#L52, no need for extra minus,

activatedgeek commented 4 years ago

the minus sign in author's code is due to the derivation that is hardcoded by the author

Ah apologies, I misspoke.

What happens when you compare dxkxy against dxkxy_pt? Maybe the values are already too small?

I mentioned the sum of those values in different runs. They are very close. Anything else that you'd like me to try given the fact that the kernel evaluations are precisely the same?

Check https://github.com/janhuenermann/svgd/blob/master/SVGD.py or https://github.com/dtak/ocbnn-public/blob/master/bnn/svgd.py#L52, no need for extra minus,

I agree. There shouldn't be a need to manipulate the sign. I'm not sure why autograd is giving me gradients with signs effectively inverted. Any ideas to debug this?

activatedgeek commented 4 years ago

UPDATE: I don't particular see the difference b/w the TF2 code you referenced and my implementation except for the gradient sign. It is weird that it works on all the Mixture of Gaussian examples I have.

Let me know if you were able to investigate this further.

shwangtangjun commented 4 years ago

For those who are still interested in why there should be a minus sign in the code (otherwise the result is far from satisfying) , I have thoroughly investigated the algorithm provided in the paper, and found that the last term is not "actual gradient".

For example, let's consider a set of initial particles with only 2 points. According to the algorithm, \phi(x) should be

phi

Or, in matrix form

phi_matrix

However, our RBF kernel should be

kernel

which is the transpose of the matrix appeared in \phi(x). Although RBF kernel is symmetric and k(x) equals k(x)^T, the last derivative term should be manually provided with a minus sign due to a simple fact (more detailed explanation later):

minus

Other than that, during coding, I find that there also should be a factor 1/2 before the derivative if we use autograd mechanism in PyTorch or Tensorflow. Before explanation, let's have a brief review of autograd mechanism. Here, the output is k(x), which is a matrix, and the input is x, which is a vector. The following is my code kernel = kernel_rbf(inputs) kernel_grad = torch.autograd.grad(kernel, inputs, torch.ones_like(kernel))[0]

The most confusing part (at least for me) is how torch.ones_like(kernel) functions. As far as I know, the derivative of a matrix to a vector is not well-defined. PyTorch implements in the following way, like a weighted sum

autograd

Currently we have a=b=c=d=1 (or in author's code: sum(), which outputs the same result). Then, since

kernel_equal

We have

kernel_grad

Let's now go back to the matrix form \phi(x). The derivative term is

derivative

Finally! We have found out where the minus sign comes from. Also we find that there should be a factor 1/2. However, in author's code, why isn't there a 1/2 ? That is because the author manually calls detach() to the inputs while calculating kernel matrix K_XX = self.K(X, X.detach())

Method detach() can separate a tensor from the computational graph, thus not providing any gradient. As a result, the autograd happens to provide a gradient that is half the "true gradient". But in my opinion, this is more like a coincidence, and maybe the following coding style is more heuristic K_XX = self.K(X, X) grad_K = -0.5*autograd.grad(K_XX.sum(), X)[0]

That's all! Took me pretty much time to find out why, but also very rewarding. I hope this answer can help you. I have also created a repository https://github.com/shwangtangjun/SVGD-PyTorch, which implements SVGD in a more pythonic style.

activatedgeek commented 3 years ago

Hey @shwangtangjun ! Just noticed that you've taken much effort to figure this out. Thanks for this!