NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
28 stars 6 forks source link

Hyperoptimization metrics - implementation of $`\varphi^{2}`$ estimator #1849

Closed Cmurilochem closed 6 months ago

Cmurilochem commented 10 months ago

We are interested in implementing an additional metrics to hyperopt that is sensitive to higher moments of the probability distribution, $\varphi^{2}$; see Eq.(4.6) of the NNPDF3.0 paper. As defined therein and extended by @RoyStegeman and @juanrojochacon to the context of hyperoptimization, $\varphi^{2}$ can be calculated for each $k$-fold as

$$\Huge \varphi_{\chi^2_k}^2 = \langle \chi^2k [ \mathcal{T}[f{\rm fit}], \mathcal{D} ] \rangle_{\rm rep} - \chi^2k [ \langle \mathcal{T}[f{\rm fit}] \rangle_{\rm rep}, \mathcal{D} ] $$

where the first term represents our usual averaged-over-replicas hyper loss, $\chi^2_k$, that is calculated based on the dataset used in the fit ($\mathcal{D}$) and the theory predictions from each fitted PDF ($f_{\rm fit}$) replica. The second term of the above equation would involve the calculation of the hyper loss but now using the theory predictions from the central PDF (averaged-over-replicas PDF - if I understood well).

The idea would be to implement this new metrics as an additional @staticmethod of the HyperLoss class.

I noticed that there already exists an implementation of $\varphi$ (probably from NNPDF3.0 paper) in the phi_data function in validphys. This function depends on the abs_chi2_data function which in turn depends on results.

To avoid code duplication, I think it would be nice to use these functions probably via n3fit/vpinterface.py.

The problem is that I really do not know how to use these functions from validphys, specially results that depends on covariance_matrix and sqrt_covmat arguments.

Please, could anybody help me on that or even suggest any alternative way to do so ? I would appreciate it very much you help.

Radonirinaunimi commented 10 months ago

Hi @Cmurilochem, I'd have to look into this in details in order to make well informed comments. But I will say that one thing you would want to check is what we briefly discussed on Wednesday, that is if whether or not validphys still holds into the memory the objects that were used in the fit. If that's the case, there should be some reasonable ways to compute $\varphi$ using indeed n3fit/vpinterface.py even if it doesn't rely directly on the implemented functions/methods.

Maybe others already have more insights?

scarlehoff commented 10 months ago

hi @Cmurilochem, I will prepare something more complete, but from the time being the following code will allow you to already play a bit with validphys and understand how it works (for more, here, the documentation is not complete but will give you some idea already)

Here I'm using the API to automagically get some results. In practice, most of the items for which I use API are available internally from validphys in n3fit, the API however might be useful when scripting and when you are developing to understand how it works.

First we go to the computation of phi in validphys. There's a few different functions like this, but this one seems interesting: https://github.com/NNPDF/nnpdf/blob/ce6c05cacec95dee36c577503c9a1109daa2908a/validphys2/src/validphys/results.py#L542

Which uses the result of abs_chi2_data (we are starting to see how validphys works, when you use a runcard or from inside validphys, it will try to construct a graph from your final request to your input, if it is able to do so, it will run, otherwise it will try to tell you what you are missing).

We can start by running this by adding calling the API with some inputs:

from validphys.api import API
API.phi_data(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal", pdf="NNPDF40_nnlo_as_01180")

This will already produce a value. And you already have all these quantities when you are doing a fit. The three first keys are given in the runcard and the PDF will be given by the vpinterface. Of course, since you are using the vp interface from inside the code you are not allowed to use API. Also, it would be silly to carry all this input since at the level of the fit you have already loaded the data, theory and cuts. You might as well use something closer to the low level function.

You could as well do this:

from validphys.api import API
from validphys.results import phi_data
chi2 = API. abs_chi2_data(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal", pdf="NNPDF40_nnlo_as_01180")
phi_data(chi2)

Then if you go to the definition of abs_chi2_data here you will see that it requires a Results quantity, which is made of theory and data. This results quantity, if you look for it within the same file you see that it requires: 1) a dataset 2) a PDF 3) a covmat

We got there! These are 3 elements that we already have during the fit! The dataset is prepared before arriving to validphys, the covmat you have access to as it is used to create the losses and the PDF you will create with vpinterface. In the code below I will create them with the API*, since I'm just scripting, but this is all you need (for a single dataset):

from validphys.api import API
from validphys.results import phi_data, abs_chi2_data, results
ds = API.dataset(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal")
covmat = API.covariance_matrix(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal")
sqcov = API.sqrt_covmat(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal")
pdf = API.pdf(pdf="NNPDF40_nnlo_as_01180")
res = results(ds, pdf, covmat, sqcov)
chi2 = abs_chi2_data(res)
phi_data(chi2)

And we have our final result! A number for phi! This works for only one dataset, to collect over datasets you will need to play a bit more (in n3fit you have a covmat per experiment, for instance), but I think it is a good start.

Note: to first approximation (and probably to second and third) you should not need to modify anything in validphys to get this number from n3fit. At most you will need to modify things to the signature of performfit https://github.com/NNPDF/nnpdf/blob/ce6c05cacec95dee36c577503c9a1109daa2908a/n3fit/src/n3fit/performfit.py#L21 which is what will be read by validphys to fill in the different items.

*and I'm being lazy, from the dataset it is possible to get the covmat (and sqrt covmat) without further calls to the API by calling the appropriate functions

Cmurilochem commented 9 months ago

Thanks @scarlehoff for you very detailed explanation. It works perfectly here for me. During the past few days however I have been trying to learn how I could run these functions internally without the API. @APJansen stepped in today to help us as you realised from #1726.

The idea here (as you suggested) is to call results directly as the only way around. For this we would need to have as arguments

I saw that you have already suggested an alternative idea in how to obtain $\varphi^{2}$ from the central PDF via models["experimental"] (as the possible average of PDF_0, PDF_1, ...PDF_N_replica$ layers) in #1726. Maybe I will continue our discussion there together with @APJansen.

scarlehoff commented 9 months ago

Hi @Cmurilochem, some more notes to that:

For the PDF there's no way out, you need that the vpinterface N3PDF "tricks" validphys into believing there is a central value and construct the central value by taking the average of the replicas.

Instead, for the rest, you have all information by the time the fit starts. The best way to test this is to go to the definition of the performfit function and add new arguments there. Put a breakpoint just after that, you will see that those arguments are automatically filled by validphys. This should not be expensive in time or memory since most of these have been anyway filled by validphys already to generate the rest of the n3fit input.

In this case you would be interested in something like experiments_data, that's collection of all datasets by experiment. With that and the "fake" PDF you already have dataset and pdf. The datasets that you will get from experiments_data also include all the information about the covariance matrix (don't know by heart the method for that) so you should have everything.

Let me know if anything doesn't work, I'm writing from memory so don't trust 100% the details (if it is not experiments_data it might be experiments_dataset or something like that, have a look at results.py in the validphys package where most of this is used.

Cmurilochem commented 9 months ago

Thank you @scarlehoff! Yeap, I tested it. I have two options here that appeared after including these arguments in performfit: data of validphys.core.DataGroupSpec type and experiments_data of list[validphys.core.DataGroupSpec] type which could be used together with dataset_inputs_results in results.py. I will look closely into it in the next days. Unfortunately, I will not be able to join our meeting tomorrow but will keep up you updated from here. Thanks for you kind help.

Cmurilochem commented 9 months ago

Hi @scarlehoff. Thanks to your help I think I found a provisory solution to start with. The idea is to:

I will be reporting more details and possible ToDos/problems in #1726.