Potential issue with (quoted) Validation chi2 for CDHSW_F2, CHORUS_F2, NUTEV_F2

Radonirinaunimi commented 2 years ago

There seems to be some issues with the (quoted) validation $\chi^2$ values for the CDHSW_F2, CHORUS_F2, and NUTEV_F2 datasets. As an example, a fit including only the NUTEV_F2 (with one single replica) yield the following results:	Dataset	Epoch	REP ID	$\rm{N}_{\rm tr}$	$\chi^2_{\rm tr}$	$\rm{N}_{\rm vl}$	$\chi^2_{\rm vl}$	$\rm{N}_{\rm tot}$	$\chi^2_{\rm exp}$
NUTEV_F2	2316	35	58	1.847	20	922.098	78	8.721

The reason why there are $\rm{N}_{\rm tot} = 78$ is because no maximum $Q^{2}$ cut is imposed. The 20 data-points forming the validation set are shown in the table below (with $Q^2_M$ the mapped values). The data vs prediction comparisons are also shown below for all the slices in $x$.	$x$	$Q^2_M~([0, 1])$
0.015	0
0.015	0.051
0.015	0.206
0.045	0.051
0.125	0.323
0.125	0.466
0.175	0.115
0.175	0.323
0.175	0.466
0.175	0.609
0.275	0.206
0.275	0.856
0.35	0.466
0.45	0.609
0.45	0.739
0.45	0.856
0.55	0.609
0.55	0.739
0.65	0.466
0.65	0.739

As shown above, there shouldn't be a reason for the validation $\chi^2$ to be large this large given that no prediction is (significantly) far way from the true validation points.

juanrojochacon commented 2 years ago

This seems to be a bug - else how come the chi2_val is so huge?

It may also be useful to plot theory as ratio to data, to better identify this kind of discrepancies. But I don't see how one can get such a poor chi2 since the data in the validation subset agrees rather well with the theory

Radonirinaunimi commented 2 years ago

This is admittedly an odd issue. For a different replica, the results are given below. Notice how the $\chi^2{\rm exp}$ has improved significantly while the $\chi^2{\rm vl}$ is still (slightly) worse.

Dataset	Epoch	REP ID	$\rm{N}_{\rm tr}$	$\chi^2_{\rm tr}$	$\rm{N}_{\rm vl}$	$\chi^2_{\rm vl}$	$\rm{N}_{\rm tot}$	$\chi^2_{\rm exp}$
NUTEV_F2	5559	1	58	0.659	20	11.721	78	1.893

In the same way as before, the 20 data-points forming the validation set are shown in the table below:	$x$	$Q^2_M~([0, 1])$
0.015	0
0.015	0.115
0.015	0.323
0.080	0.
0.125	0.115
0.125	0.466
0.125	0.609
0.125	0.739
0.175	0.206
0.175	0.856
0.225	0.466
0.275	0.609
0.275	0.856
0.35	0.466
0.45	0.323
0.45	0.974
0.55	0.609
0.55	0.856
0.55	0.934
0.65	0.974

The issue does not appear to be in the computation of the $\chi^2{\rm vl}$ (nor in how the results are presented in the report) since: (a) this artifact does not concern the other datasets, and (b) some replicas are better with a somehow reasonable values of $\chi^2{\rm vl}$. The problem is that most (over $90$%) of the replicas for these datasets are bad.
Based on the two results above, it indeed seems that even a small discrepancy between data & predictions can lead to a very large value of $\chi^2_{\rm vl}$. The question is: how is this possible?

Radonirinaunimi commented 2 years ago

So the problem was that some datapoints (in the above example, only a single point) were artificially large because the shifts were so large due to a bug that is fixed in #45. Now the results are reasonable (and converge faster):	Dataset	Epoch	REP ID	$\rm{N}_{\rm tr}$	$\chi^2_{\rm tr}$	$\rm{N}_{\rm vl}$	$\chi^2_{\rm vl}$	$\rm{N}_{\rm tot}$	$\chi^2_{\rm exp}$
NUTEV_F2	567	35	58	1.171	20	2.7866	78	2.501

NNPDF / nnusf

Potential issue with (quoted) Validation chi2 for CDHSW_F2, CHORUS_F2, NUTEV_F2 #44