Open Justinezgh opened 2 years ago
I started from 100, 200, 500 or 1000 with the prior as proposal distribution and then approximate the posterior sequentially with and without score but I still can't see difference
Ok, I think we'll need to actually take a look at the posteriors to understand what's going on, and focus on a single round of additional training. So let's say you start with 100 sims. Can we see what your posterior looks like at that point?
And then, you add 50 or 100 more simulations, and can we look at the posterior then with or without grads ?
A few comments:
Note that from your plots it looks like things have pretty much already converged to the final c2st value after 300 sims, so you may just not have enough granularity in number of simulations to see a change of behavior
For building intuition and developing a proof of concept, I don't think running scripts in batch mode is the best way to go, it's usually very useful to hack on a notebook first to try to get a feel of how things evolve. That's why I'd like to see one colab notebook, that concentrates on just one round of sequential training, which we can easily hack to try to understand why gradients are not helping more.
Also the c2st metric is not going to tell us the full story, it's only a metric, so looking at the the actual posterior visually will be more informative in the development phase ;-)
Because we know that if we start from a proposal that is not too far from the true posterior, gradients should be helping. This we know.
So, the thing to understand, is whether there are conditions by which the proposal is "bad" in a way that makes the gradients not helpful. For instance, maybe if the proposal doesnt fully cover the posterior that creates issues, in which case, we could smooth out the proposal for instance.
The other thing I guess that could go wrong is weighting the different rounds of training. I guess it's possible that if the model is not super flexible, it could stay stuck in the minimum wanted by the first round of training, and ignore the second round that only comparatively have a few data points.
With 100 sim the posterior looks like this:
From this if we add 100 simulations (so we have 200 simulations):
We add 100 sim again (so we have 300 sim):
We add 100 sim again (so we have 400 sim):
We add 100 sim again (so we have 500 sim):
Same thing but with simulations and 1e-6 score
From the posterior with only 100 sim if we add 100 sim (200 sim in total):
300 sim in total:
400 sim in total:
500 sim in total:
yes ok I'll do a notebook !
You can find the notebook here
the posterior with 100 sim:
from the posterior with 100 sim if we add 50 sim:
from the posterior with 100 sim if we add 50 sim and 1e-7 * gradients:
I did this one more round and I got this:
Ahhh that's pretty interesting!
So, it does work, right? maybe if you add 10 simulations at a time it will be even more obvious
It seems to work yes and yes I'll try to add 10 simulations at a time (it's just a bit slow with the mcmc etc ^^')
first posterior with 100 sim
from this when we add 10 sim: 10 sim again: 10 sim again:
from the first posterior if we add 10 sim and 1e-7 score:
10 sim and score again:
10 sim and score again:
For each round I trained the NF with 2000 updates but since I only add 10 simulations I though that's it was maybe too much so I trained the first NF with 2000 updates and then only 1000 simulations for each round and I got these plots:
The absolute values of the C2st are still not great but at least the plots do make our point.
Ok, but in theory we would like the NN to properly converge after each round. In practice however, if the optimisation is restarted from the previous point, you could remain stuck in a very bad minimum if you optimize too much in the first round.
What you could try is to restart optimization from scratch after each round, just to see if early poor optimization is an issue.
green: for each round the NF is trained using the previous params and 2k updates blue: for each round the NF is trained using the previous params and 2k updates for the first round, then 1k updates red: for each round the NF is trained from scratch using 2k updates
(uncertainty: 10 NFs)
NLE without score:
NLE with 1e-7 score:
SNLE with R= 5:
SNLE with R=5 and 1e-7 score for each round:
-> yes SNLE is better but I don't see big difference between with or without score