Address potential overfitting wrt matching data

NNPDF / nnusf

An open source machine learning framework that provides predictions for all-energy neutrino structure functions.

https://nnpdf.github.io/nnusf/

GNU General Public License v3.0

0 stars 0 forks source link

Address potential overfitting wrt matching data #55

Closed Radonirinaunimi closed 2 years ago

Radonirinaunimi commented 2 years ago

It seems that the models are over-fitting on the Yadism matching data, see the following report for example. We might do something about this.

RoyStegeman commented 2 years ago

Why do you think they are overfitted? The training chi2 is ~1 for the yadism data and ~2 for the experimental data, which is what one would expect I think. Namely, yadism data has only one level of statistical fluctuations (that corresponding to the pseudata generation), while the experimental data has the fluctuations from psuedodata generation but this is on top of the fluctuations already present from the fact that the experimental central values are already randomly sampled values (acompanied by some possible inconsistencies due to experiment or theory that may further affect the chi2).

If you want to really test for overfitting you could of course check how the agreement with the fitted yadism data compares to the agreement with some other predictions that are not in the matching dataset, but for now these results don't really worry me too much.

Radonirinaunimi commented 2 years ago

I was indeed expecting the $\chi^2$ of the matching data to be better than the real experimental data. What I was slightly worried about was the very small values of $\chi^{2, \rm exp}_{\rm match}$. It could be that these values are what one would expect, in the sense that this situation is similar to a level 0 CT(?).

RoyStegeman commented 2 years ago

Is the exp chi2 defined wrt the central value PDF or is it calculated for each PDF and then averaged? In the first case I would indeed expect it to vanish as 1/sqrt(Nrep). In the latter case I am not entirely sure what to expect, but note that also for a regular NNPDF fit the average experimental chi2 is quite a bit lower than the average test/validation losses (with a tiny contribution coming from the t0 prescription being used for exp and not tr/vl losses)

Radonirinaunimi commented 2 years ago

Is the exp chi2 defined wrt the central value PDF or is it calculated for each PDF and then averaged? In the first case I would indeed expect it to vanish as 1/sqrt(Nrep). In the latter case I am not entirely sure what to expect, but note that also for a regular NNPDF fit the average experimental chi2 is quite a bit lower than the average test/validation losses (with a tiny contribution coming from the t0 prescription being used for exp and not tr/vl losses)

Currently, the experimental $\chi^2$ are calculated as the latter, ie calculated for each PDF and then averaged.

RoyStegeman commented 2 years ago

So the pseudodata by construction has chi2=1 (within stat. fluctuations). In the report it seems that somehow the chi2 defined to the central data is close to 0, while the chi2 defined to the psuedodata is close to 1. This almost gives the impression that the NN doesn't really fit the fluctuations but is rather unaffected by the level-1 noise we introduce when generating pseudodata. Not sure if this is a problem or not (maybe it is, since the uncertainties of the matching are coming from NNPDF4.0 and maybe we want to reproduce them exactly and not get smaller uncertainties), but if anything I would think it's underfitting rather than overfitting. If it is a problem, adding an additional layer of noise (so level-2 data) should place the matching data on the same footing as the real data change the posterior distribution in the matching region

juanrojochacon commented 2 years ago

Hi @RoyStegeman @Radonirinaunimi I would treat yadism and real data on exactly the same footing, so also the yadism pseudo-data should be fluctuated twice wrt to the true values (like in a level 2 closure test, basically).

If we do this, we should find that after the fit, for individual replicas chi2/ndat \sim 2, while for the central prediction averaged over replicas chi2/ndat \sim 1, both for real data and for yadism pseudo-data.

I think this is the correct approach conceptually, and it is also easier to explain.

Does it make sense?

RoyStegeman commented 2 years ago

I agree that, that seems to be the way to go.

This is indeed what we want to achieve:

the central prediction averaged over replicas chi2/ndat \sim 1

juanrojochacon commented 2 years ago

perfect, then let;s get this done ;)