Detect NaNs in FK tables

felixhekhorn commented 2 months ago

I suggest to act here instead of eko - specifically https://github.com/NNPDF/pineko/blob/e3aeba27c1dd328bebe4bd52c90e107a7cd9a1f2/src/pineko/evolve.py#L332-L335 https://github.com/NNPDF/pineko/blob/e3aeba27c1dd328bebe4bd52c90e107a7cd9a1f2/src/pineko/comparator.py#L47 then we can check

if all elements in after are NaN we can raise an error
if some elements in after are NaN we can print a warning

What do you think @giacomomagni @Radonirinaunimi ?

Radonirinaunimi commented 2 months ago

Yes, that solution seems sensible to me.

giacomomagni commented 2 months ago

To do the proper fix, we need also to propose a solution: i.e. replace the Nans with zeros or similar, because otherwise we don't have an easy way out. The number of point is fixed by the commondata and the pinecard, but then one need to act on both of them. Unfortunately an eko with only zeros is also not allowed, so my trick was to replace them with an identity and alpha_s with 0.0, but I agree this is just a dirty trick

Radonirinaunimi commented 2 months ago

To do the proper fix, we need also to propose a solution: i.e. replace the Nans with zeros or similar, because otherwise we don't have an easy way out. The number of point is fixed by the commondata and the pinecard, but then one need to act on both of them.

But this we can always act at the level of vp by essentially adding some filter rules. So, as long as the NaN are only in the places where we expect them to be, then it will be fine at least for vp. And I also agree that NaNs are valid floating types so I'd rather keep that with possible.

giacomomagni commented 2 months ago

but as you saw Nans where propagated also to other values, when doing the convolution right? Otherwise we would not have had this issue at all

Radonirinaunimi commented 2 months ago

but as you saw Nans where propagated also to other values, when doing the convolution right? Otherwise we would not have had this issue at all

Yes, I thought you were arguing in favor of replacing the NaNs after they are only fixed to appear in the expected places. But I agree that the current behavior which is - preventing the NaNs to propagate into all the bins should be fixed.

felixhekhorn commented 2 months ago

preventing the NaNs to propagate into all the bins should be fixed.

I still don't understand how that may happen - except by a dynamic scale choice and then this is expected. Can you please clarify?

To do the proper fix, we need also to propose a solution: i.e. replace the Nans with zeros or similar, because otherwise we don't have an easy way out.

mh? why would we do that? they are non-sense and one should not use them in any true way

The number of point is fixed by the commondata and the pinecard, but then one need to act on both of them.

a) we can also act at vp level and b) that's way some bins might well be NaN - they have simply to be ignored (e.g. all the low-Q2 DIS data)

Unfortunately an eko with only zeros is also not allowed, so my trick was to replace them with an identity and alpha_s with 0.0, but I agree this is just a dirty trick

the way that I proposed above is after ekos, so you can not do something like this. Moreover, we don't want to do such a thing - as already said above: they are non-sense and have to stay like this. You have to cut points and not put them to zero.

giacomomagni commented 2 months ago

sure I agree that you have to cut points. Not sure if vp supports to load Fktable that do not match the size of the commondata. If this is possible then the simplest solution is just to drop the point with nans, here and construct a smaller FkTable.

I don't know the Nans are propagated. Here you have the grid, eko and fktable for a test dataset SMCSX_NC_24GEV_MUD_G1:

test_nans.zip

This is what I get after the convolution with a pdf. I see only the fist bin contain a Nan eko.

LHAPDF 6.4.0 loading /project/theorie/gmagni/miniconda3/envs/nnpdf/share/LHAPDF/NNPDFpol10_100/NNPDFpol10_100_0000.dat
NNPDFpol10_100 PDF set, member #0, version 1; LHAPDF ID = 250000
    FKTable      Grid
0       NaN -0.000200
1       NaN -0.000047
2       NaN  0.000115
3       NaN  0.000169
4       NaN  0.000318
5       NaN  0.000714
6       NaN  0.001276
7       NaN  0.000958
8       NaN -0.000510
9       NaN -0.000676
10      NaN  0.001414
11      NaN  0.004291
12      NaN  0.008647
13      NaN  0.017348
14      NaN  0.029094

alecandido commented 2 months ago

I agree with @felixhekhorn: nan should be left as they are. They are meaningful, as values that have not been computed properly. They should not be used in any other way.

If they propagate, it means there is something downstream that needs a value that was impossible to compute. So, the dependent value should be impossible to compute as well, because of a missing dependency. The algebra of nan is meaningful. Circumventing it, replacing values, is harmful.

About errors, it could be fine, but I'd raise them as late as possible. The only trade-off is saving useless operations, when you know in advance they will be useless. I'd avoid warnings everywhere: better return additional values, that could be understood by another program. Warnings are written for humans, and frequently (if not always) ignored. They become just junk in your output... (it's different if you're carefully collecting logs, but then they should go through your logging machinery, not a separate warnings' mechanism)

The true solution to the problem here is tracing and understanding nans propagation, and why so many values depend on nans generated at low scale.

Raising errors is just a shortcut to avoid unneeded computation. But otherwise keep going with nans.

alecandido commented 2 months ago

sure I agree that you have to cut points. Not sure if vp supports to load Fktable that do not match the size of the commondata. If this is possible then the simplest solution is just to drop the point with nans, here and construct a smaller FkTable.

I would that if it supports that, that's a bug, not a feature. Arrays shapes should match, not silently cut dimensions.

Sorry to repeat myself: nans are not an infectious disease, that you should contain by adding barriers. They are tracing dependency in your calculation. If a whatever dependency is missing, the output can not be computed. And that's meaningful - you should change something in your calculation.

Radonirinaunimi commented 2 months ago

To be more explicit, we should do the following:

find out why the NaNs are propagated to all the bins and provide a fix. Then, the resulting FK table will be one such that it only contains NaNs in the place where they are expected to be. In this way, the FK table and the commondata will still contain the exact same binning.
add filter rules to vp to remove bins that contain NaNs to make sure that they are cut away (if even if they will pass kinematic cuts).

and we should not do the following:

Removing manually the bins that yield NaNs in the FK table that'll lead into a mismatch between data and FK table
not doing anything about the NaNs

This is in line with what @alecandido is arguing. I can help with debugging why NaNs are propagating.

giacomomagni commented 2 months ago

Raising errors is just a shortcut to avoid unneeded computation.

This is indeed partly the thing that is missing :)

felixhekhorn commented 2 months ago

Here you have the grid, eko and fktable for a test dataset SMCSX_NC_24GEV_MUD_G1:

I have problems:

I can not get the grid predictions:

$ pineappl convolute grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 NNPDFpol11_100
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDFpol11_100/NNPDFpol11_100_0000.dat
NNPDFpol11_100 PDF set, member #0, version 1; LHAPDF ID = 251000
b     Q2            x           2xg1    
[GeV^2]        []            []     
--+----+----+-------+-------+-----------
0 0.03 0.03 0.00011 0.00011 0.0000000e0
1 0.06 0.06 0.00022 0.00022 0.0000000e0
2  0.1  0.1 0.00039 0.00039 0.0000000e0
3 0.17 0.17 0.00063 0.00063 0.0000000e0
4 0.26 0.26   0.001   0.001 0.0000000e0
5  0.4  0.4  0.0016  0.0016 0.0000000e0
6 0.63 0.63  0.0025  0.0025 0.0000000e0
7 1.09 1.09  0.0043  0.0043 0.0000000e0
8 1.85 1.85  0.0078  0.0078 0.0000000e0
9 3.15 3.15  0.0143  0.0143 0.0000000e0
10 5.09 5.09  0.0245  0.0245 0.0000000e0
11    7    7  0.0346  0.0346 0.0000000e0
12  9.6  9.6  0.0487  0.0487 0.0000000e0
13 14.7 14.7   0.077   0.077 0.0000000e0
14 22.9 22.9   0.121   0.121 0.0000000e0

I can not reproduce the FK table:

$ pineko convolute grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 eko_SMCSX_NC_24GEV_MUD_G1.tar myfk.pineappl.lz4 3 0 --pdf NNPDFpol11_100
┌───────────────┐
│ Computing ... │
└───────────────┘
grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4
+ eko_SMCSX_NC_24GEV_MUD_G1.tar
= myfk.pineappl.lz4
with max_as=3, max_al=0, xir=1.0, xif=1.0 
thread '<unnamed>' panicked at pineappl_py/src/grid.rs:609:14:
Nothing returned from evolution.: EvolutionFailure("no non-zero operator found; result would be an empty FkTable")

and I'm not even failing because of the operator, but because the pids are somehow messed up as the error is raised here in PineAPPL.

For the FK table you that put in the tar I indeed get NaNs all over

used versions: pineko=master, pineappl==0.7.3

two comments for @cschwan (maybe) unrelated to the NaN issue discussed here:

after 1. I thought okay maybe the grid is empty and optimize should dissolve the grid, but no - instead it made the grid bigger

$ pineappl write --optimize grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 gridop.pineappl.lz4
$ ls -la grid*
-rw-rw-r-- 1 felix felix 46530 18. Jul 13:36 gridop.pineappl.lz4
-rw-r--r-- 1 felix felix 42828 18. Jul 12:17 grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4

re 2: are you sure you want to raise an error here and not just return an empty FK table? (the same way grids might e.g. by misconfiguration be empty)

giacomomagni commented 2 months ago

@felixhekhorn have you tried to load the grid with the newer pineappl version and the related pineko PR, as both the grid and the fktable follow the new convention and are polarized ?

felixhekhorn commented 2 months ago

@felixhekhorn have you tried to load the grid with the newer pineappl version and the related pineko PR, as both the grid and the fktable follow the new convention and are polarized ?

ok, with #181 (which implies pineappl==0.8.1) I can reproduce the NaNs ... this should not happen. I'll try to dig a bit ...

cschwan commented 2 months ago

test_nans.zip

From what I can gather the EKO with the squared factorization scale 0.03 for the x-grid values with index 0 is NaN for all PIDs.

felixhekhorn commented 2 months ago

test_nans.zip

From what I can gather the EKO with the squared factorization scale 0.03 for the x-grid values with index 0 is NaN for all PIDs.

yes, and that is fine - the problem is: all bins turn to NaN and not only that one

cschwan commented 2 months ago

two comments for @cschwan (maybe) unrelated to the NaN issue discussed here:

* after 1. I thought okay maybe the grid is empty and optimize should dissolve the grid, but no - instead it made the grid _bigger_

$ pineappl write --optimize grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 gridop.pineappl.lz4
$ ls -la grid*
-rw-rw-r-- 1 felix felix 46530 18. Jul 13:36 gridop.pineappl.lz4
-rw-r--r-- 1 felix felix 42828 18. Jul 12:17 grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4

This seems to be the lz4_flex crate performing worse than the lz4 binary and I suppose this is the missing high compression: https://github.com/pseitz/lz4_flex?tab=readme-ov-file#todo. Try to decompress the the optimized grid using lz4 -d ... and compress again with lz4 --best .... You'll probably get a similar result to what you started with.

* re 2: are you sure you want to raise an error here and not just return an empty FK table? (the same way grids might e.g. by misconfiguration be empty)

We can remove the error, but usually an empty FK-table is a sign of a problem.

cschwan commented 2 months ago

test_nans.zip

From what I can gather the EKO with the squared factorization scale 0.03 for the x-grid values with index 0 is NaN for all PIDs.

yes, and that is fine - the problem is: all bins turn to NaN and not only that one

That's indeed a problem if there a NaNs, and it is fixed in this commit https://github.com/NNPDF/pineappl/commit/9ca3022ef8288b9360584ced5d78f6701f7667e9. A nice 'side-effect' of that commit is that ATLAS_1JET_8TEV_R06 now takes less than 5 minutes to evolve instead of 2-3 hours. Apparently we did a lot of linear algebra with zeros :scream: :face_holding_back_tears:.

felixhekhorn commented 2 months ago

Thanks for the fix @cschwan - unfortunately the next problem is just waiting for us :see_no_evil:

$ pineko convolve grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 eko_SMCSX_NC_24GEV_MUD_G1.tar myfk.pineappl.lz4 3 0 --pdf NNPDFpol11_100
┌───────────────┐
│ Computing ... │
└───────────────┘
   grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4
 + eko_SMCSX_NC_24GEV_MUD_G1.tar
 = myfk.pineappl.lz4
 with max_as=3, max_al=0, xir=1.0, xif=1.0 
/home/felix/Physik/N3PDF/FK/test/test_nans/test_nans/env/lib/python3.12/site-packages/eko/io/manipulate.py:39: UserWarning: The new grid is close to the current one
  warnings.warn("The new grid is close to the current one")
Optimizing for Nf6Ind
LHAPDF 6.5.4 loading /home/felix/local/share/LHAPDF/NNPDFpol11_100/NNPDFpol11_100_0000.dat
NNPDFpol11_100 PDF set, member #0, version 1; LHAPDF ID = 251000
    Q2 left  Q2 right   x left  x right  PineAPPL   FkTable  permille_error
0      0.03      0.03  0.00011  0.00011  0.000075       NaN             NaN
1      0.06      0.06  0.00022  0.00022  0.000750 -3.108669   -4.145571e+06
2      0.10      0.10  0.00039  0.00039  0.001188  0.012425    9.455017e+03
3      0.17      0.17  0.00063  0.00063  0.001198  0.014155    1.081314e+04
4      0.26      0.26  0.00100  0.00100  0.001005  0.005172    4.144236e+03
5      0.40      0.40  0.00160  0.00160  0.000756  0.002525    2.338793e+03
6      0.63      0.63  0.00250  0.00250  0.000622  0.002132    2.425944e+03
7      1.09      1.09  0.00430  0.00430 -0.000267  0.001492   -6.591994e+03
8      1.85      1.85  0.00780  0.00780 -0.001109  0.000840   -1.757323e+03
9      3.15      3.15  0.01430  0.01430 -0.000820  0.000544   -1.663877e+03
10     5.09      5.09  0.02450  0.02450  0.001257  0.000623   -5.043569e+02
11     7.00      7.00  0.03460  0.03460  0.004040  0.000777   -8.075818e+02
12     9.60      9.60  0.04870  0.04870  0.008376  0.000965   -8.848183e+02
13    14.70     14.70  0.07700  0.07700  0.017355  0.001228   -9.292186e+02
14    22.90     22.90  0.12100  0.12100  0.029710  0.001441   -9.514848e+02

as said many time: the small scale is crazy either way - but for large scales grid and FK should match!

Asking the obvious question: the commit looks fine - but are we sure, we keep all significant numbers? 3h -> 5min seems to good to be true ...

cschwan commented 2 months ago

For me it works OK for both NNPDFpol10_100 and NNPDFpol11_100:

$ pineappl evolve test_nans/grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 test_nans/eko_SMCSX_NC_24GEV_MUD_G1.tar deleteme2.pineappl NNPDFpol10_100 
b      Grid         FkTable      rel. diff
--+-------------+-------------+-------------
0  -1.9952907e-4           NaN           NaN
1  -4.7371111e-5 -2.4636624e-1   5.1997698e3
2   1.1477011e-4  1.0576968e-3   8.2157867e0
3   1.6858672e-4 -4.7133169e-3  -2.8957818e1
4   3.1849340e-4 -2.4589258e-3  -8.7204921e0
5   7.1417450e-4  9.7571016e-5 -8.6337931e-1
6   1.2756512e-3  1.4496402e-3  1.3639225e-1
7   9.5754601e-4  9.6154852e-4  4.1799681e-3
8  -5.0967539e-4 -5.0429946e-4 -1.0547753e-2
9  -6.7556878e-4 -6.9313368e-4  2.6000162e-2
10  1.4143775e-3  1.3610328e-3 -3.7716014e-2
11  4.2909944e-3  4.2142684e-3 -1.7880709e-2
12  8.6468737e-3  8.5545595e-3 -1.0676018e-2
13  1.7347675e-2  1.7235750e-2 -6.4519096e-3
14  2.9094309e-2  2.9052034e-2 -1.4530270e-3
Error: grids are different

$ pineappl evolve test_nans/grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 test_nans/eko_SMCSX_NC_24GEV_MUD_G1.tar deleteme2.pineappl NNPDFpol11_100 
b      Grid         FkTable      rel. diff
--+-------------+-------------+-------------
0   7.4683807e-5           NaN           NaN
1   7.5005813e-4 -7.0928484e-1  -9.4663982e2
2   1.1884434e-3  1.9886700e-2   1.5733401e1
3   1.1982348e-3  1.1197249e-2   8.3447872e0
4   1.0054592e-3  3.9899255e-3   2.9682620e0
5   7.5620092e-4  1.2555767e-3  6.6037454e-1
6   6.2238374e-4  6.5659309e-4  5.4965044e-2
7  -2.6685384e-4 -2.6695924e-4  3.9495481e-4
8  -1.1094437e-3 -1.1138484e-3  3.9701437e-3
9  -8.1966310e-4 -8.3076021e-4  1.3538628e-2
10  1.2570481e-3  1.2289553e-3 -2.2348249e-2
11  4.0403197e-3  3.9898041e-3 -1.2502876e-2
12  8.3759581e-3  8.2973138e-3 -9.3892901e-3
13  1.7355381e-2  1.7220139e-2 -7.7925238e-3
14  2.9709758e-2  2.9615224e-2 -3.1819107e-3
Error: grids are different

cschwan commented 2 months ago

Asking the obvious question: the commit looks fine - but are we sure, we keep all significant numbers? 3h -> 5min seems to good to be true ...

I'm quite positive about this, because the code is unit tested and all code paths are covered. I also have an understanding of where these zeros come from: for the ATLAS single-jet inclusive measurement there are 888 slices, but apparently each slice is only used in a single bin and so the code before the commit mentioned above did lots of unnecessary matrix multiplications that resulted in zeros. After the commit these zeros are skipped and therefore it's really fast; but actually it's not that the new code is fast, it's that the old code was extremely slow.

felixhekhorn commented 2 months ago

I'm giving up for today, again I can not reproduce the numbers :disappointed: ... I can not get the PineAPPL CLI to not return NaN (it seems not master - why?), nor can I understand what pineko is doing different then PineAPPL. @giacomomagni can you please check if you can either confirm my numbers or @cschwan (i.e. if it is a pineko problem) ... note our grid predictions match

giacomomagni commented 2 months ago

Hi @felixhekhorn and @cschwan. Thanks a lot for fixing pineappl !!! In the end it was good to push for a solution of this issue. @cschwan may I ask for a new pineappl version including that fix ?

@felixhekhorn the eko I provided you it's a single theory of numerical FONLL, so I think differences are genuine. I'll double check that the final fktables are good, comparing with the old ones (ie theory 823)

cschwan commented 2 months ago

Hi @felixhekhorn and @cschwan. Thanks a lot for fixing pineappl !!! In the end it was good to push for a solution of this issue. @cschwan may I ask for a new pineappl version including that fix ?

Absolutely! Here you go: https://pypi.org/project/pineappl/0.8.2/.

felixhekhorn commented 2 months ago

@felixhekhorn the eko I provided you it's a single theory of numerical FONLL, so I think differences are genuine. I'll double check that the final fktables are good, comparing with the old ones (ie theory 823)

I guess @cschwan and me we were using the same eko, so I wonder still what is the difference ...

giacomomagni commented 2 months ago

Ahh okay I see, on main it works, so the problem is related to how we are using the new evolve method. (Btw pineko convolve is not yet able to deal with two ekos, I need to fix it)

giacomomagni commented 2 months ago

Okay, pineko was not using the correct Pids. Fixed in 6812bdd

NNPDF / pineko

Detect NaNs in FK tables #185