NNPDF / pineappl

PineAPPL is not an extension of APPLgrid
https://nnpdf.github.io/pineappl/
GNU General Public License v3.0
13 stars 3 forks source link

Fix bug in `Grid::convolute_eko` #182

Closed alecandido closed 2 years ago

alecandido commented 2 years ago

I reverted https://github.com/N3PDF/pineappl/commit/b7014fdc4ecf9ed3ef15f08cc47ee5761ed0b471, as @cschwan suspect this to be the issue https://github.com/N3PDF/pineappl/pull/103#discussion_r984360664.

alecandido commented 2 years ago

@felixhekhorn can you test on your example?

codecov[bot] commented 2 years ago

Codecov Report

Merging #182 (10d5ccd) into master (7851ebc) will increase coverage by 0.01%. The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master     #182      +/-   ##
==========================================
+ Coverage   89.32%   89.33%   +0.01%     
==========================================
  Files          47       47              
  Lines        6932     6930       -2     
==========================================
- Hits         6192     6191       -1     
+ Misses        740      739       -1     
Flag Coverage Δ
python 80.64% <ø> (ø)
rust 89.45% <77.77%> (+0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pineappl/src/grid.rs 87.45% <77.77%> (+0.06%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

felixhekhorn commented 2 years ago

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!

I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in https://github.com/NNPDF/pineko/issues/62 and confirm or deny the conclusion

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

andreab1997 commented 2 years ago

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!

I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

Ok very good, thank you very much for this. As soon as I have time I will run a lot of tests to really be sure, but I would say that most likely we found the issue.

alecandido commented 2 years ago

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

@cschwan can we maybe consider an optimization for the grid, in which we reinterpolate all the subgrids on a common xgrid?

For sure this will bring some loss in accuracy, but might be an acceptable one (against a considerable speed-up).

cschwan commented 2 years ago

@andreab1997 @felixhekhorn as a guiding principle: all grids converted from APPLgrids are potentially affected by this bug, grids generated from pinefarm (with yadism or Madgraph5@MC_NLO) should be OK.

cschwan commented 2 years ago

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

@cschwan can we maybe consider an optimization for the grid, in which we reinterpolate all the subgrids on a common xgrid?

For sure this will bring some loss in accuracy, but might be an acceptable one (against a considerable speed-up).

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow. However, I've written a realisitic unit test and with that I'll be able to improve the performance. Monday is a holiday in Germany and by then I hopefully have something faster!

alecandido commented 2 years ago

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow.

The idea would be to trash this PR and rely on the assumptions that they are all the same. According to @felixhekhorn

this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

so this bug fix is heavily affecting performances. If we can live without, so much the better.

cschwan commented 2 years ago

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow.

The idea would be to trash this PR and rely on the assumptions that they are all the same. According to @felixhekhorn

This I don't understand: why wouldn't we want to merge this pull request?

this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

so this bug fix is heavily affecting performances. If we can live without, so much the better.

I can always write a program that calculates the wrong results very quickly, but that isn't very helpful, of course.

felixhekhorn commented 2 years ago

As far as I understand this won't help (yet), Grid::convolute_eko is simply inherently slow.

The idea would be to trash this PR and rely on the assumptions that they are all the same. According to @felixhekhorn

hm? if they are coming from the outside (meaning APPL) there is not much we can do, can we? reinterpolate is an option, but not the short term answer

andreab1997 commented 2 years ago

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy!

I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion

I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

I am computing now all the fktables that are listed in "Things to check" in https://github.com/NNPDF/pineko/issues/62 with this branch and I am ticking its elements one by one as I check them. For the moment the three that I have computed are fine so I believe also the others will be fine as well. Thank you again!

alecandido commented 2 years ago

This I don't understand: why wouldn't we want to merge this pull request? I can always write a program that calculates the wrong results very quickly, but that isn't very helpful, of course.

Of course, we want correct results, but we want also to be fast. That's why I proposed reinterpolation: if we can transform subgrids, such that they all have a common x, we will loose a bit in precision, but for a great gain in performances.

Zaharid commented 2 years ago

fwiw the more I think about it the more I am convinced that what we want at the fit level is a common x grid. The fitting problem becomes a lot easier to reason about then.

Measuring how fine the grid needs to be is relatively easy.

alecandido commented 2 years ago

fwiw the more I think about it the more I am convinced that what we want at the fit level is a common x grid. The fitting problem becomes a lot easier to reason about then.

I believe we all agree, that is definitely a good thing. But it is about the xgrid in the FkTable, that is coming from the EKO output, and it is already guaranteed.

Here we are talking of the EKO input (eko is already reinterpolating, so allows for input xgrid different from output), that has to be matched with the equivalent of an APPLgrid. Unfortunately, it seems that not all the old APPLgrids where using a single xgrid for all the subgrids...

By the way @cschwan, I believe that the current status (this PR) might also be wrong: if subgrid A has a different xgrid than subgrid B, in order to use a single EKO we have to compute it on the union (since we are not reinterpolating it on the fly). But then the subgrid is convoluted only over a subset of the EKO, and this should be wrong: even if you have a subset of the points in the extended one, the interpolation basis is not a subset of the basis, because the support of the polynomials in the smaller basis has to be larger than for the corresponding in the one related to the union. Thus I believe we need reinterpolation, and also what we were doing before might have been wrong (and now I doubt also what was doing APFELcomb on those grids...)

cschwan commented 2 years ago

@AleCandido I think what you're probably doing is to calculate the EKO for each bin (the x grids don't change over luminosities and there's only one order) and then merge all EKOs. I think in this case there shouldn't be a problem, since most likely the x grids for each bin are all pairwise disjoint sets. If this isn't the case, merging EKOs for different bins should fail. In other words there's probably no problem.

alecandido commented 2 years ago

@AleCandido I think what you're probably doing is to calculate the EKO for each bin (the x grids don't change over luminosities and there's only one order) and then merge all EKOs. I think in this case there shouldn't be a problem, since most likely the x grids for each bin are all pairwise disjoint sets. If they aren't the merging of the EKOs for each bin should be going wrong.

Don't think so, since we rely on grid information to construct the EKO, in particular on GridAxes, that

So, given a grid, PineAPPL is always returning us a single xgrid, not a bin-dependent one. The only thing we can do at this point, it is to compute one single EKO for the single xgrid given.

alecandido commented 2 years ago

And here is the proof that we are computing the EKO for the union xgrid: https://github.com/N3PDF/pineappl/blob/ad22780918c4452cca0c4b6ddded6a600237f143/pineappl/src/grid.rs#L1369-L1378 (and not the union of the EKOs for each individual xgrid).

The key is, of course, x_grid.extend_from_slice().

felixhekhorn commented 2 years ago

Thus I believe we need reinterpolation, and also what we were doing before might have been wrong (and now I doubt also what was doing APFELcomb on those grids...)

I don't believe this is a problem, because we're already at the level of matrix multiplication (after the grid), so the interpolation only plays a role inside the grid. Afterwards, it is just linear algebra: meaning eko is exact on the grid points (by construction) and so the information from the input to any output point is exact

cschwan commented 2 years ago

Let's say you have a grid with two bins for DIS. The first bin has this x grid:

[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]

and the second bin:

[0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85]

which means you calculate the EKO for the union of the x grid points given above. For the first bin that means the evolution is done using the a subset (the first set), while the other x grid points contribute with zero. I don't see anything wrong with that; the interpolation should still work.

alecandido commented 2 years ago

I don't believe this is a problem, because we're already at the level of matrix multiplication (after the grid), so the interpolation only plays a role inside the grid. Afterwards, it is just linear algebra: meaning eko is exact on the grid points (by construction) and so the information from the input to any output point is exact

One of the two is exact, this is true: matrix multiplication is always a contraction over two paired, but different, indices:

  1. on the interpolation grid (function evaluate at that point) and
  2. on the interpolation basis (coefficient of that polynomial)

This only works if the index over the EKO is of kind 1., can you convince me of this? @felixhekhorn

alecandido commented 2 years ago

you calculate the EKO for the union of the x grid points given above. For the first bin that means the evolution is done using the a subset (the first set), while the other x grid points contribute with zero

@cschwan the problem is there for one of the EKO indices (input or output, it is always confusing to me). On one of the two, the EKO is the coefficient over one basis of functions: if you compute the coefficients of a given function over an extended basis, but then you only use a subset, you are effectively trashing the others, so you are telling that they should be multiplied by zeros. But that is not true: if the yadism function is evaluated at a limited subset of points, it doesn't mean that is zero elsewhere.

cschwan commented 2 years ago

@AleCandido if we use data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv:1109.5141-Z0_eta34.pineappl.lz4, delete everything but, say, the first bin, and then evolve the grid with one bin using

  1. the already computed EKO and compare that against
  2. the result obtained using a newly-computed EKO only for the first bin

we should be able to see what happens and how big the difference is.

andreab1997 commented 2 years ago

With this branch I get

$ pineappl diff data/grids/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 data/fktables/208/ATLASWZRAP36PB-ATLAS-arXiv\:1109.5141-Z0_eta34.pineappl.lz4 NNPDF40_nlo_as_01180 --ignore-orders --ignore-lumis
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF40_nlo_as_01180/NNPDF40_nlo_as_01180_0000.dat
NNPDF40_nlo_as_01180 PDF set, member #0, version 1; LHAPDF ID = 331700
b   x1                  diff              
-+---+---+-----------+-----------+--------
0   0 0.4 1.3463245e5 1.3450066e5 9.798e-4
1 0.4 0.8 1.3347254e5 1.3334425e5 9.621e-4
2 0.8 1.2 1.3077692e5 1.3066370e5 8.665e-4
3 1.2 1.6 1.2699118e5 1.2689062e5 7.925e-4
4 1.6   2 1.2116815e5 1.2109076e5 6.391e-4
5   2 2.4 1.1276176e5 1.1269892e5 5.575e-4
6 2.4 2.8 9.8484125e4 9.8421191e4 6.394e-4
7 2.8 3.6 5.7843820e4 5.7813500e4 5.244e-4

which seems to catch the problem - we're back at permille accuracy! I suggest, @andreab1997, you can recompute (whenever you have time, of course) the problematic grids in NNPDF/pineko#62 and confirm or deny the conclusion I also can partly confirm another comment of @andreab1997: this took ~2h to compute which he observed for the good commits as well (in contrast to "short runs" which resulted in wrong FK tables)

I am computing now all the fktables that are listed in "Things to check" in NNPDF/pineko#62 with this branch and I am ticking its elements one by one as I check them. For the moment the three that I have computed are fine so I believe also the others will be fine as well. Thank you again!

Sorry to step into the discussion but I just wanted to add a detail. It seems that for the fktables in the list "Things with a conjectured solution" the conjectured solution was wrong because with this branch I got the correct result.

cschwan commented 2 years ago

@andreab1997 so if I understand you then in every case you've seen so far the differences between the grids and the corresponding FK tables are small? That would be reassuring.

alecandido commented 2 years ago

we should be able to see what happens and how big the difference is.

Let me think one moment, and let's wait for @felixhekhorn reply.

In any case, I don't believe the exercise to be extremely relevant: either it is correct, or it is wrong. Empirically correct is telling us little, because I know that if it is wrong I can make the result arbitrarily different, so we would rely on the specific xgrids chosen (and they are not chosen by us). Moreover, empirically correct definition is based on agreement with APFELcomb, but we may well have perfect agreement, and just being both wrong in the same way.

cschwan commented 2 years ago

@AleCandido we had a similar discussion here: https://github.com/N3PDF/pineappl/issues/151.

andreab1997 commented 2 years ago

@andreab1997 so if I understand you then in every case you've seen so far the differences between the grids and the corresponding FK tables are small? That would be reassuring.

Yes exactly, most of the times under the permille level. Sometimes the error is slightly above 1 permille but never more than that.

alecandido commented 2 years ago

@AleCandido we had a similar discussion here: #151.

In that case, we were considering different grids. Now the problem is within a single grid. But I agree it is a similar problem.

cschwan commented 2 years ago

In any case, I don't believe the exercise to be extremely relevant: either it is correct, or it is wrong. Empirically correct is telling us little, because I know that if it is wrong I can make the result arbitrarily different, so we would rely on the specific xgrids chosen (and they are not chosen by us). Moreover, empirically correct definition is based on agreement with APFELcomb, but we may well have perfect agreement, and just being both wrong in the same way.

If in all cases we've seen so far the differences are small then I believe - which is a rather weak statement, I admit - the interpolation can't be too far off and at the very least we probably don't have to hold off producing fits with the new pipeline.

In any case I agree we should thoroughly test it and being correct is paramount!

andreab1997 commented 2 years ago

Since we are now pretty sure that this was the source of the bug, @cschwan are you going to release a new version? Or do you want to wait to improve performances first? For me it makes essentially no difference (although it is obviuosly simpler with a release).

cschwan commented 2 years ago

@andreab1997 I'll probably release a new version at the end of today.

@AleCandido @felixhekhorn given that we fixed the initial problem, are you OK with merging this pull request?

alecandido commented 2 years ago

@AleCandido @felixhekhorn given that we fixed the initial problem, are you OK with merging this pull request?

Yes, let's merge this one, and go back to the former status. We can keep discussing about the rest separately.

cschwan commented 2 years ago

@andreab1997 I've just released v0.5.6 which includes this fix. Crates and packages should become available soon!

andreab1997 commented 2 years ago

@andreab1997 I've just released v0.5.6 which includes this fix. Crates and packages should become available soon!

Ok thank you very much!