Train a Smooth Normalizing Flow with a score matching loss

astrodeepnet / sbi_experiments

Simulation Based Inference experiments

MIT License

3 stars 3 forks source link

Train a Smooth Normalizing Flow with a score matching loss #13

Open EiffL opened 2 years ago

EiffL commented 2 years ago

Now that we have a prototype of a working C^k coupling layer (by #10), we can try to see if we can build a full flow that can be trained with the score.

This issue is to document our tests and results.

Justinezgh commented 2 years ago

So if I try to get the score with _loss=(scorehat - score)^2, _logbound=4 (meaning that max(a) approx 54) and activation function=sin, I have this :

I think because of the behavior of both the forward/inverse ldj :

So I tried with _logbound=2 (max(a) approx 7) and _min_density_lowerbound=4e-1 ( in order to have smoother forward/inverse ldj)

but it still doesn't work.

So I computed the grad(ildj) and grad(grad ildj) and the grad(grad ildj) degenerates so the optimization seems to not be possible

For these plots I use :

EiffL commented 2 years ago

hummmmm interesting.... Which bijector is this? the cubic ramp?

Justinezgh commented 2 years ago

yup

Justinezgh commented 2 years ago

EiffL commented 2 years ago

not great....

I'm trying out their simple NonCompactAffineSigmoid to see what that does. It doesn't use a ramp at all, just a rescaling of a sigmoid function

Justinezgh commented 2 years ago

You mean this :

EiffL commented 2 years ago

Humm yeah I think.

It seems to work decently, but our newton method to find the root fails from time to time, and it blows up during training... I'm trying to see what can be done to improve that....

EiffL commented 2 years ago

Ok, I've merged my code for this bijector above ^^^^^

Did you get a nicely working configuration?

Justinezgh commented 2 years ago

Yup so I used 3 CL and an exponential decay of the learning rate each 1000 updates. notebook

Here are the results for 10 000 updates (I saved the contour plot each 1000 updates to see the evolution) :

loss = nll
loss = score
loss = score + nll
loss = 2*score + nll
loss = score + 2*nll

EiffL commented 2 years ago

nice :-D ok so it looks like this model can be fitted to the target distribution quite well under an NLL. The score kinda works.... but is having a hard time with getting the relative amplitude of the two modes correctly.

In your combination experiment I guess you need to downweight the score loss a lot more, the value of the score loss is probably much larger than that of the NLL, by orders of mag, which means that most likely it dominates the 2 losses during training.

Because the NLL gives you a perfect result in this test, you won't be able to see an improvement with adding the score loss. So, I think you can try the following:

Use a fixed batch of examples (e.g. 128)
Train with NLL alone and observe that it overfits a lot
(optional) Train with score loss alone and see how it looks
Train different models with the NLL as the main loss but gradualy adding a small amount of score loss, and see if it starts improving things.

EiffL commented 2 years ago

I had 2 mins so I did a quick try with just 64 data points:

Training with NLL alone:
Training with NLL + score/1000:

\o/