Closed felixhekhorn closed 2 months ago
Yes, that solution seems sensible to me.
To do the proper fix, we need also to propose a solution: i.e. replace the Nans
with zeros or similar, because otherwise we don't have an easy way out.
The number of point is fixed by the commondata and the pinecard, but then one need to act on both of them.
Unfortunately an eko with only zeros is also not allowed, so my trick was to replace them with an identity and alpha_s with 0.0, but I agree this is just a dirty trick
To do the proper fix, we need also to propose a solution: i.e. replace the
Nans
with zeros or similar, because otherwise we don't have an easy way out. The number of point is fixed by the commondata and the pinecard, but then one need to act on both of them.
But this we can always act at the level of vp
by essentially adding some filter rules. So, as long as the NaN
are only in the places where we expect them to be, then it will be fine at least for vp
. And I also agree that NaN
s are valid floating types so I'd rather keep that with possible.
but as you saw Nans
where propagated also to other values, when doing the convolution right?
Otherwise we would not have had this issue at all
but as you saw
Nans
where propagated also to other values, when doing the convolution right? Otherwise we would not have had this issue at all
Yes, I thought you were arguing in favor of replacing the NaN
s after they are only fixed to appear in the expected places. But I agree that the current behavior which is - preventing the NaN
s to propagate into all the bins should be fixed.
preventing the
NaN
s to propagate into all the bins should be fixed.
I still don't understand how that may happen - except by a dynamic scale choice and then this is expected. Can you please clarify?
To do the proper fix, we need also to propose a solution: i.e. replace the
Nans
with zeros or similar, because otherwise we don't have an easy way out.
mh? why would we do that? they are non-sense and one should not use them in any true way
The number of point is fixed by the commondata and the pinecard, but then one need to act on both of them.
a) we can also act at vp level and b) that's way some bins might well be NaN - they have simply to be ignored (e.g. all the low-Q2 DIS data)
Unfortunately an eko with only zeros is also not allowed, so my trick was to replace them with an identity and alpha_s with 0.0, but I agree this is just a dirty trick
the way that I proposed above is after ekos, so you can not do something like this. Moreover, we don't want to do such a thing - as already said above: they are non-sense and have to stay like this. You have to cut points and not put them to zero.
sure I agree that you have to cut points. Not sure if vp
supports to load Fktable that do not match the size of the commondata. If this is possible then the simplest solution is just to drop the point with nans, here and construct a smaller FkTable.
I don't know the Nans
are propagated. Here you have the grid, eko and fktable for a test dataset SMCSX_NC_24GEV_MUD_G1:
This is what I get after the convolution with a pdf.
I see only the fist bin contain a Nan
eko.
LHAPDF 6.4.0 loading /project/theorie/gmagni/miniconda3/envs/nnpdf/share/LHAPDF/NNPDFpol10_100/NNPDFpol10_100_0000.dat
NNPDFpol10_100 PDF set, member #0, version 1; LHAPDF ID = 250000
FKTable Grid
0 NaN -0.000200
1 NaN -0.000047
2 NaN 0.000115
3 NaN 0.000169
4 NaN 0.000318
5 NaN 0.000714
6 NaN 0.001276
7 NaN 0.000958
8 NaN -0.000510
9 NaN -0.000676
10 NaN 0.001414
11 NaN 0.004291
12 NaN 0.008647
13 NaN 0.017348
14 NaN 0.029094
I agree with @felixhekhorn: nan
should be left as they are. They are meaningful, as values that have not been computed properly. They should not be used in any other way.
If they propagate, it means there is something downstream that needs a value that was impossible to compute. So, the dependent value should be impossible to compute as well, because of a missing dependency.
The algebra of nan
is meaningful. Circumventing it, replacing values, is harmful.
About errors, it could be fine, but I'd raise them as late as possible. The only trade-off is saving useless operations, when you know in advance they will be useless. I'd avoid warnings everywhere: better return additional values, that could be understood by another program. Warnings are written for humans, and frequently (if not always) ignored. They become just junk in your output... (it's different if you're carefully collecting logs, but then they should go through your logging machinery, not a separate warnings' mechanism)
The true solution to the problem here is tracing and understanding nan
s propagation, and why so many values depend on nan
s generated at low scale.
Raising errors is just a shortcut to avoid unneeded computation. But otherwise keep going with nan
s.
sure I agree that you have to cut points. Not sure if
vp
supports to load Fktable that do not match the size of the commondata. If this is possible then the simplest solution is just to drop the point with nans, here and construct a smaller FkTable.
I would that if it supports that, that's a bug, not a feature. Arrays shapes should match, not silently cut dimensions.
Sorry to repeat myself: nan
s are not an infectious disease, that you should contain by adding barriers.
They are tracing dependency in your calculation. If a whatever dependency is missing, the output can not be computed. And that's meaningful - you should change something in your calculation.
To be more explicit, we should do the following:
NaN
s are propagated to all the bins and provide a fix. Then, the resulting FK table will be one such that it only contains NaN
s in the place where they are expected to be. In this way, the FK table and the commondata will still contain the exact same binning.NaN
s to make sure that they are cut away (if even if they will pass kinematic cuts).and we should not do the following:
NaN
s in the FK table that'll lead into a mismatch between data and FK tableNaN
sThis is in line with what @alecandido is arguing. I can help with debugging why NaN
s are propagating.
Raising errors is just a shortcut to avoid unneeded computation.
This is indeed partly the thing that is missing :)
Here you have the grid, eko and fktable for a test dataset SMCSX_NC_24GEV_MUD_G1:
I have problems:
I can not get the grid predictions:
$ pineappl convolute grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 NNPDFpol11_100
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDFpol11_100/NNPDFpol11_100_0000.dat
NNPDFpol11_100 PDF set, member #0, version 1; LHAPDF ID = 251000
b Q2 x 2xg1
[GeV^2] [] []
--+----+----+-------+-------+-----------
0 0.03 0.03 0.00011 0.00011 0.0000000e0
1 0.06 0.06 0.00022 0.00022 0.0000000e0
2 0.1 0.1 0.00039 0.00039 0.0000000e0
3 0.17 0.17 0.00063 0.00063 0.0000000e0
4 0.26 0.26 0.001 0.001 0.0000000e0
5 0.4 0.4 0.0016 0.0016 0.0000000e0
6 0.63 0.63 0.0025 0.0025 0.0000000e0
7 1.09 1.09 0.0043 0.0043 0.0000000e0
8 1.85 1.85 0.0078 0.0078 0.0000000e0
9 3.15 3.15 0.0143 0.0143 0.0000000e0
10 5.09 5.09 0.0245 0.0245 0.0000000e0
11 7 7 0.0346 0.0346 0.0000000e0
12 9.6 9.6 0.0487 0.0487 0.0000000e0
13 14.7 14.7 0.077 0.077 0.0000000e0
14 22.9 22.9 0.121 0.121 0.0000000e0
I can not reproduce the FK table:
$ pineko convolute grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 eko_SMCSX_NC_24GEV_MUD_G1.tar myfk.pineappl.lz4 3 0 --pdf NNPDFpol11_100
┌───────────────┐
│ Computing ... │
└───────────────┘
grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4
+ eko_SMCSX_NC_24GEV_MUD_G1.tar
= myfk.pineappl.lz4
with max_as=3, max_al=0, xir=1.0, xif=1.0
thread '<unnamed>' panicked at pineappl_py/src/grid.rs:609:14:
Nothing returned from evolution.: EvolutionFailure("no non-zero operator found; result would be an empty FkTable")
and I'm not even failing because of the operator, but because the pids are somehow messed up as the error is raised here in PineAPPL.
For the FK table you that put in the tar I indeed get NaNs all over
used versions: pineko=master
, pineappl==0.7.3
two comments for @cschwan (maybe) unrelated to the NaN issue discussed here:
after 1. I thought okay maybe the grid is empty and optimize should dissolve the grid, but no - instead it made the grid bigger
$ pineappl write --optimize grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 gridop.pineappl.lz4
$ ls -la grid*
-rw-rw-r-- 1 felix felix 46530 18. Jul 13:36 gridop.pineappl.lz4
-rw-r--r-- 1 felix felix 42828 18. Jul 12:17 grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4
re 2: are you sure you want to raise an error here and not just return an empty FK table? (the same way grids might e.g. by misconfiguration be empty)
@felixhekhorn have you tried to load the grid with the newer pineappl
version and the related pineko PR,
as both the grid and the fktable follow the new convention and are polarized ?
@felixhekhorn have you tried to load the grid with the newer
pineappl
version and the related pineko PR, as both the grid and the fktable follow the new convention and are polarized ?
ok, with #181 (which implies pineappl==0.8.1
) I can reproduce the NaNs ... this should not happen. I'll try to dig a bit ...
From what I can gather the EKO with the squared factorization scale 0.03
for the x-grid values with index 0
is NaN for all PIDs.
From what I can gather the EKO with the squared factorization scale
0.03
for the x-grid values with index0
is NaN for all PIDs.
yes, and that is fine - the problem is: all bins turn to NaN and not only that one
two comments for @cschwan (maybe) unrelated to the NaN issue discussed here:
* after 1. I thought okay maybe the grid is empty and optimize should dissolve the grid, but no - instead it made the grid _bigger_
$ pineappl write --optimize grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 gridop.pineappl.lz4 $ ls -la grid* -rw-rw-r-- 1 felix felix 46530 18. Jul 13:36 gridop.pineappl.lz4 -rw-r--r-- 1 felix felix 42828 18. Jul 12:17 grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4
This seems to be the lz4_flex
crate performing worse than the lz4
binary and I suppose this is the missing high compression: https://github.com/pseitz/lz4_flex?tab=readme-ov-file#todo. Try to decompress the the optimized grid using lz4 -d ...
and compress again with lz4 --best ...
. You'll probably get a similar result to what you started with.
* re 2: are you sure you want to raise an error here and not just return an empty FK table? (the same way grids might e.g. by misconfiguration be empty)
We can remove the error, but usually an empty FK-table is a sign of a problem.
From what I can gather the EKO with the squared factorization scale
0.03
for the x-grid values with index0
is NaN for all PIDs.yes, and that is fine - the problem is: all bins turn to NaN and not only that one
That's indeed a problem if there a NaN
s, and it is fixed in this commit https://github.com/NNPDF/pineappl/commit/9ca3022ef8288b9360584ced5d78f6701f7667e9. A nice 'side-effect' of that commit is that ATLAS_1JET_8TEV_R06
now takes less than 5 minutes to evolve instead of 2-3 hours. Apparently we did a lot of linear algebra with zeros :scream: :face_holding_back_tears:.
Thanks for the fix @cschwan - unfortunately the next problem is just waiting for us :see_no_evil:
$ pineko convolve grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 eko_SMCSX_NC_24GEV_MUD_G1.tar myfk.pineappl.lz4 3 0 --pdf NNPDFpol11_100
┌───────────────┐
│ Computing ... │
└───────────────┘
grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4
+ eko_SMCSX_NC_24GEV_MUD_G1.tar
= myfk.pineappl.lz4
with max_as=3, max_al=0, xir=1.0, xif=1.0
/home/felix/Physik/N3PDF/FK/test/test_nans/test_nans/env/lib/python3.12/site-packages/eko/io/manipulate.py:39: UserWarning: The new grid is close to the current one
warnings.warn("The new grid is close to the current one")
Optimizing for Nf6Ind
LHAPDF 6.5.4 loading /home/felix/local/share/LHAPDF/NNPDFpol11_100/NNPDFpol11_100_0000.dat
NNPDFpol11_100 PDF set, member #0, version 1; LHAPDF ID = 251000
Q2 left Q2 right x left x right PineAPPL FkTable permille_error
0 0.03 0.03 0.00011 0.00011 0.000075 NaN NaN
1 0.06 0.06 0.00022 0.00022 0.000750 -3.108669 -4.145571e+06
2 0.10 0.10 0.00039 0.00039 0.001188 0.012425 9.455017e+03
3 0.17 0.17 0.00063 0.00063 0.001198 0.014155 1.081314e+04
4 0.26 0.26 0.00100 0.00100 0.001005 0.005172 4.144236e+03
5 0.40 0.40 0.00160 0.00160 0.000756 0.002525 2.338793e+03
6 0.63 0.63 0.00250 0.00250 0.000622 0.002132 2.425944e+03
7 1.09 1.09 0.00430 0.00430 -0.000267 0.001492 -6.591994e+03
8 1.85 1.85 0.00780 0.00780 -0.001109 0.000840 -1.757323e+03
9 3.15 3.15 0.01430 0.01430 -0.000820 0.000544 -1.663877e+03
10 5.09 5.09 0.02450 0.02450 0.001257 0.000623 -5.043569e+02
11 7.00 7.00 0.03460 0.03460 0.004040 0.000777 -8.075818e+02
12 9.60 9.60 0.04870 0.04870 0.008376 0.000965 -8.848183e+02
13 14.70 14.70 0.07700 0.07700 0.017355 0.001228 -9.292186e+02
14 22.90 22.90 0.12100 0.12100 0.029710 0.001441 -9.514848e+02
as said many time: the small scale is crazy either way - but for large scales grid and FK should match!
Asking the obvious question: the commit looks fine - but are we sure, we keep all significant numbers? 3h -> 5min seems to good to be true ...
For me it works OK for both NNPDFpol10_100
and NNPDFpol11_100
:
$ pineappl evolve test_nans/grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 test_nans/eko_SMCSX_NC_24GEV_MUD_G1.tar deleteme2.pineappl NNPDFpol10_100
b Grid FkTable rel. diff
--+-------------+-------------+-------------
0 -1.9952907e-4 NaN NaN
1 -4.7371111e-5 -2.4636624e-1 5.1997698e3
2 1.1477011e-4 1.0576968e-3 8.2157867e0
3 1.6858672e-4 -4.7133169e-3 -2.8957818e1
4 3.1849340e-4 -2.4589258e-3 -8.7204921e0
5 7.1417450e-4 9.7571016e-5 -8.6337931e-1
6 1.2756512e-3 1.4496402e-3 1.3639225e-1
7 9.5754601e-4 9.6154852e-4 4.1799681e-3
8 -5.0967539e-4 -5.0429946e-4 -1.0547753e-2
9 -6.7556878e-4 -6.9313368e-4 2.6000162e-2
10 1.4143775e-3 1.3610328e-3 -3.7716014e-2
11 4.2909944e-3 4.2142684e-3 -1.7880709e-2
12 8.6468737e-3 8.5545595e-3 -1.0676018e-2
13 1.7347675e-2 1.7235750e-2 -6.4519096e-3
14 2.9094309e-2 2.9052034e-2 -1.4530270e-3
Error: grids are different
$ pineappl evolve test_nans/grid_SMCSX_NC_24GEV_MUD_G1.pineappl.lz4 test_nans/eko_SMCSX_NC_24GEV_MUD_G1.tar deleteme2.pineappl NNPDFpol11_100
b Grid FkTable rel. diff
--+-------------+-------------+-------------
0 7.4683807e-5 NaN NaN
1 7.5005813e-4 -7.0928484e-1 -9.4663982e2
2 1.1884434e-3 1.9886700e-2 1.5733401e1
3 1.1982348e-3 1.1197249e-2 8.3447872e0
4 1.0054592e-3 3.9899255e-3 2.9682620e0
5 7.5620092e-4 1.2555767e-3 6.6037454e-1
6 6.2238374e-4 6.5659309e-4 5.4965044e-2
7 -2.6685384e-4 -2.6695924e-4 3.9495481e-4
8 -1.1094437e-3 -1.1138484e-3 3.9701437e-3
9 -8.1966310e-4 -8.3076021e-4 1.3538628e-2
10 1.2570481e-3 1.2289553e-3 -2.2348249e-2
11 4.0403197e-3 3.9898041e-3 -1.2502876e-2
12 8.3759581e-3 8.2973138e-3 -9.3892901e-3
13 1.7355381e-2 1.7220139e-2 -7.7925238e-3
14 2.9709758e-2 2.9615224e-2 -3.1819107e-3
Error: grids are different
Asking the obvious question: the commit looks fine - but are we sure, we keep all significant numbers? 3h -> 5min seems to good to be true ...
I'm quite positive about this, because the code is unit tested and all code paths are covered. I also have an understanding of where these zeros come from: for the ATLAS single-jet inclusive measurement there are 888 slices, but apparently each slice is only used in a single bin and so the code before the commit mentioned above did lots of unnecessary matrix multiplications that resulted in zeros. After the commit these zeros are skipped and therefore it's really fast; but actually it's not that the new code is fast, it's that the old code was extremely slow.
I'm giving up for today, again I can not reproduce the numbers :disappointed: ... I can not get the PineAPPL CLI to not return NaN (it seems not master - why?), nor can I understand what pineko is doing different then PineAPPL. @giacomomagni can you please check if you can either confirm my numbers or @cschwan (i.e. if it is a pineko problem) ... note our grid predictions match
Hi @felixhekhorn and @cschwan. Thanks a lot for fixing pineappl
!!!
In the end it was good to push for a solution of this issue.
@cschwan may I ask for a new pineappl
version including that fix ?
@felixhekhorn the eko I provided you it's a single theory of numerical FONLL, so I think differences are genuine. I'll double check that the final fktables are good, comparing with the old ones (ie theory 823)
Hi @felixhekhorn and @cschwan. Thanks a lot for fixing
pineappl
!!! In the end it was good to push for a solution of this issue. @cschwan may I ask for a newpineappl
version including that fix ?
Absolutely! Here you go: https://pypi.org/project/pineappl/0.8.2/.
@felixhekhorn the eko I provided you it's a single theory of numerical FONLL, so I think differences are genuine. I'll double check that the final fktables are good, comparing with the old ones (ie theory 823)
I guess @cschwan and me we were using the same eko, so I wonder still what is the difference ...
Ahh okay I see, on main
it works, so the problem is related to how we are using the new evolve
method.
(Btw pineko convolve
is not yet able to deal with two ekos, I need to fix it)
Okay, pineko
was not using the correct Pids
. Fixed in 6812bdd
related to https://github.com/NNPDF/eko/issues/389
I suggest to act here instead of
eko
- specifically https://github.com/NNPDF/pineko/blob/e3aeba27c1dd328bebe4bd52c90e107a7cd9a1f2/src/pineko/evolve.py#L332-L335 https://github.com/NNPDF/pineko/blob/e3aeba27c1dd328bebe4bd52c90e107a7cd9a1f2/src/pineko/comparator.py#L47 then we can checkafter
are NaN we can raise an errorafter
are NaN we can print a warningWhat do you think @giacomomagni @Radonirinaunimi ?