Open EiffL opened 2 years ago
So if I try to get the score with _loss=(scorehat - score)^2, _logbound=4 (meaning that max(a) approx 54) and activation function=sin, I have this :
I think because of the behavior of both the forward/inverse ldj :
So I tried with _logbound=2 (max(a) approx 7) and _min_density_lowerbound=4e-1 ( in order to have smoother forward/inverse ldj)
but it still doesn't work.
So I computed the grad(ildj) and grad(grad ildj) and the grad(grad ildj) degenerates so the optimization seems to not be possible
For these plots I use :
hummmmm interesting.... Which bijector is this? the cubic ramp?
yup
x^5
x^7
x^9
not great....
I'm trying out their simple NonCompactAffineSigmoid to see what that does. It doesn't use a ramp at all, just a rescaling of a sigmoid function
You mean this :
?
Humm yeah I think.
It seems to work decently, but our newton method to find the root fails from time to time, and it blows up during training... I'm trying to see what can be done to improve that....
Ok, I've merged my code for this bijector above ^^^^^
Did you get a nicely working configuration?
Yup so I used 3 CL and an exponential decay of the learning rate each 1000 updates. notebook
Here are the results for 10 000 updates (I saved the contour plot each 1000 updates to see the evolution) :
loss = nll
loss = score
loss = score + nll
loss = 2*score + nll
loss = score + 2*nll
nice :-D ok so it looks like this model can be fitted to the target distribution quite well under an NLL. The score kinda works.... but is having a hard time with getting the relative amplitude of the two modes correctly.
In your combination experiment I guess you need to downweight the score loss a lot more, the value of the score loss is probably much larger than that of the NLL, by orders of mag, which means that most likely it dominates the 2 losses during training.
Because the NLL gives you a perfect result in this test, you won't be able to see an improvement with adding the score loss. So, I think you can try the following:
I had 2 mins so I did a quick try with just 64 data points:
Training with NLL alone:
Training with NLL + score/1000:
\o/
Now that we have a prototype of a working C^k coupling layer (by #10), we can try to see if we can build a full flow that can be trained with the score.
This issue is to document our tests and results.