Closed Cmurilochem closed 6 months ago
Hi @Cmurilochem, I'd have to look into this in details in order to make well informed comments. But I will say that one thing you would want to check is what we briefly discussed on Wednesday, that is if whether or not validphys
still holds into the memory the objects that were used in the fit. If that's the case, there should be some reasonable ways to compute $\varphi$ using indeed n3fit/vpinterface.py even if it doesn't rely directly on the implemented functions/methods.
Maybe others already have more insights?
hi @Cmurilochem, I will prepare something more complete, but from the time being the following code will allow you to already play a bit with validphys and understand how it works (for more, here, the documentation is not complete but will give you some idea already)
Here I'm using the API to automagically get some results. In practice, most of the items for which I use API are available internally from validphys in n3fit, the API however might be useful when scripting and when you are developing to understand how it works.
First we go to the computation of phi
in validphys. There's a few different functions like this, but this one seems interesting: https://github.com/NNPDF/nnpdf/blob/ce6c05cacec95dee36c577503c9a1109daa2908a/validphys2/src/validphys/results.py#L542
Which uses the result of abs_chi2_data
(we are starting to see how validphys works, when you use a runcard or from inside validphys, it will try to construct a graph from your final request to your input, if it is able to do so, it will run, otherwise it will try to tell you what you are missing).
We can start by running this by adding calling the API with some inputs:
from validphys.api import API
API.phi_data(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal", pdf="NNPDF40_nnlo_as_01180")
This will already produce a value. And you already have all these quantities when you are doing a fit. The three first keys are given in the runcard and the PDF will be given by the vpinterface. Of course, since you are using the vp interface from inside the code you are not allowed to use API. Also, it would be silly to carry all this input since at the level of the fit you have already loaded the data, theory and cuts. You might as well use something closer to the low level function.
You could as well do this:
from validphys.api import API
from validphys.results import phi_data
chi2 = API. abs_chi2_data(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal", pdf="NNPDF40_nnlo_as_01180")
phi_data(chi2)
Then if you go to the definition of abs_chi2_data
here you will see that it requires a Results
quantity, which is made of theory and data. This results quantity, if you look for it within the same file you see that it requires: 1) a dataset 2) a PDF 3) a covmat
We got there! These are 3 elements that we already have during the fit! The dataset is prepared before arriving to validphys, the covmat you have access to as it is used to create the losses and the PDF you will create with vpinterface. In the code below I will create them with the API*, since I'm just scripting, but this is all you need (for a single dataset):
from validphys.api import API
from validphys.results import phi_data, abs_chi2_data, results
ds = API.dataset(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal")
covmat = API.covariance_matrix(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal")
sqcov = API.sqrt_covmat(dataset_input = {"dataset": "NMC"}, theoryid=400, use_cuts="internal")
pdf = API.pdf(pdf="NNPDF40_nnlo_as_01180")
res = results(ds, pdf, covmat, sqcov)
chi2 = abs_chi2_data(res)
phi_data(chi2)
And we have our final result! A number for phi! This works for only one dataset, to collect over datasets you will need to play a bit more (in n3fit you have a covmat per experiment, for instance), but I think it is a good start.
Note: to first approximation (and probably to second and third) you should not need to modify anything in validphys to get this number from n3fit. At most you will need to modify things to the signature of performfit
https://github.com/NNPDF/nnpdf/blob/ce6c05cacec95dee36c577503c9a1109daa2908a/n3fit/src/n3fit/performfit.py#L21
which is what will be read by validphys to fill in the different items.
*and I'm being lazy, from the dataset it is possible to get the covmat (and sqrt covmat) without further calls to the API by calling the appropriate functions
Thanks @scarlehoff for you very detailed explanation. It works perfectly here for me. During the past few days however I have been trying to learn how I could run these functions internally without the API. @APJansen stepped in today to help us as you realised from #1726.
The idea here (as you suggested) is to call results directly as the only way around. For this we would need to have as arguments
dataset: DataSetSpec
: I noticed that inside ModelTrainer.hyperparametrizable we have access to self.experimental["output"]
which consists of a list of ObservableWrapper @dataclass
instances that gather all info about input experimental data sets. The conversion from ObservableWrapper
to DataSetSpec
is not yet clear. pdf: PDF
: here we could use the N3PDF
objects via vpinterface.py
which is a child class of PDF
covariance_matrix
: This could be obtained from ObservableWrapper.covmat
attribute. sqrt_covmat
: from ObservableWrapper.covmat
and sqrt_covmat function from validphys
. I saw that you have already suggested an alternative idea in how to obtain $\varphi^{2}
$ from the central PDF via models["experimental"]
(as the possible average of PDF_0
, PDF_1
, ...PDF_N_replica$
layers) in #1726. Maybe I will continue our discussion there together with @APJansen.
Hi @Cmurilochem, some more notes to that:
For the PDF there's no way out, you need that the vpinterface N3PDF "tricks" validphys into believing there is a central value and construct the central value by taking the average of the replicas.
Instead, for the rest, you have all information by the time the fit starts. The best way to test this is to go to the definition of the performfit
function and add new arguments there. Put a breakpoint
just after that, you will see that those arguments are automatically filled by validphys.
This should not be expensive in time or memory since most of these have been anyway filled by validphys already to generate the rest of the n3fit input.
In this case you would be interested in something like experiments_data
, that's collection of all datasets by experiment. With that and the "fake" PDF you already have dataset
and pdf
.
The datasets that you will get from experiments_data
also include all the information about the covariance matrix (don't know by heart the method for that) so you should have everything.
Let me know if anything doesn't work, I'm writing from memory so don't trust 100% the details (if it is not experiments_data
it might be experiments_dataset
or something like that, have a look at results.py
in the validphys package where most of this is used.
Thank you @scarlehoff! Yeap, I tested it. I have two options here that appeared after including these arguments in performfit
: data
of validphys.core.DataGroupSpec
type and experiments_data
of list[validphys.core.DataGroupSpec]
type which could be used together with dataset_inputs_results in results.py
. I will look closely into it in the next days. Unfortunately, I will not be able to join our meeting tomorrow but will keep up you updated from here. Thanks for you kind help.
Hi @scarlehoff. Thanks to your help I think I found a provisory solution to start with. The idea is to:
experiments_data
as argument to ModelTrainer
instantiation inside performfit
.self.experiments_data
together with self.exp_info
inside Model.parametrizable
to create a list of tuples defining validphys.core.DataGroupSpec
representations of each group experimental data set and the associated covariant matrices.N3PDF
object as argument of compute_phi2
in vpinterface.py
to calculate the sum of $\varphi^2$. This is done by resorting to the built-in validphys.results
functions: results
, abs_chi2_data
and phi_data
.I will be reporting more details and possible ToDos/problems in #1726.
We are interested in implementing an additional metrics to
hyperopt
that is sensitive to higher moments of the probability distribution, $\varphi^{2}
$; see Eq.(4.6) of the NNPDF3.0 paper. As defined therein and extended by @RoyStegeman and @juanrojochacon to the context of hyperoptimization, $\varphi^{2}
$ can be calculated for each $k$-fold as$$\Huge \varphi_{\chi^2_k}^2 = \langle \chi^2k [ \mathcal{T}[f{\rm fit}], \mathcal{D} ] \rangle_{\rm rep} - \chi^2k [ \langle \mathcal{T}[f{\rm fit}] \rangle_{\rm rep}, \mathcal{D} ] $$
where the first term represents our usual averaged-over-replicas hyper loss, $
\chi^2_k
$, that is calculated based on the dataset used in the fit ($\mathcal{D}
$) and the theory predictions from each fitted PDF ($f_{\rm fit}$) replica. The second term of the above equation would involve the calculation of the hyper loss but now using the theory predictions from the central PDF (averaged-over-replicas PDF - if I understood well).The idea would be to implement this new metrics as an additional
@staticmethod
of theHyperLoss
class.I noticed that there already exists an implementation of $\varphi$ (probably from NNPDF3.0 paper) in the
phi_data
function invalidphys
. This function depends on theabs_chi2_data
function which in turn depends onresults
.To avoid code duplication, I think it would be nice to use these functions probably via
n3fit/vpinterface.py
.The problem is that I really do not know how to use these functions from
validphys
, speciallyresults
that depends oncovariance_matrix
andsqrt_covmat
arguments.Please, could anybody help me on that or even suggest any alternative way to do so ? I would appreciate it very much you help.