NNPDF / pineappl

PineAPPL is not an extension of APPLgrid
https://nnpdf.github.io/pineappl/
GNU General Public License v3.0
12 stars 3 forks source link

OO C++ interface #178

Closed felixhekhorn closed 1 year ago

felixhekhorn commented 1 year ago

With my collaborators we're trying to redo a previous analysis and so I was thinking about including a PineAPPL interface into my MC event generator (https://github.com/felixhekhorn/LeProHQ ) and I started to write a proper object oriented wrapper (since I like C++ much better then C ;-) ). (Actually, it will be much faster to just run the code for 1000 replicas then writing the wrapper, but since I like the project I'll try to do it nevertheless - and it can then be used in the future)

codecov[bot] commented 1 year ago

Codecov Report

Merging #178 (4c860c2) into master (7e68241) will decrease coverage by 0.66%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #178      +/-   ##
==========================================
- Coverage   89.74%   89.08%   -0.67%     
==========================================
  Files          47       47              
  Lines        7432     6935     -497     
==========================================
- Hits         6670     6178     -492     
+ Misses        762      757       -5     
Flag Coverage Δ
python 80.64% <ø> (ø)
rust 89.19% <ø> (-0.67%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pineappl_cli/src/optimize.rs 60.00% <0.00%> (-14.20%) :arrow_down:
pineappl_cli/src/upgrade.rs 76.92% <0.00%> (-10.58%) :arrow_down:
pineappl_cli/src/sum.rs 77.77% <0.00%> (-9.32%) :arrow_down:
pineappl_cli/src/delete.rs 84.21% <0.00%> (-9.13%) :arrow_down:
pineappl_fastnlo/src/lib.rs 80.00% <0.00%> (-8.53%) :arrow_down:
pineappl_cli/src/ops.rs 84.61% <0.00%> (-6.40%) :arrow_down:
pineappl_cli/src/info.rs 85.29% <0.00%> (-3.84%) :arrow_down:
pineappl_cli/src/subgrids.rs 85.84% <0.00%> (-3.71%) :arrow_down:
pineappl_cli/src/merge.rs 92.85% <0.00%> (-3.44%) :arrow_down:
pineappl_cli/src/obl.rs 92.00% <0.00%> (-1.94%) :arrow_down:
... and 12 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

felixhekhorn commented 1 year ago

Actually, @cschwan there is something I don't understand: how can pineappl_grid_fill be used to fill multi-dimensional grids? because my generator computes structure functions and so will be by default two dimensional (x, Q2) - and I can make it easily higher dimensional (say + heavy quark rapidity)

alecandido commented 1 year ago
  • heavy quark rapidity

Bins (the second dimension of the grid, according to obl sorting) are actually a multi-index. So, you can stuff in there how many dimensions you wish, for as long as they correspond to external (measured) kinematics.

cschwan commented 1 year ago

@felixhekhorn basically what @AleCandido said: for n-dimensional distributions you need to map them to 1-dimenional one using essentially bins, and after having filled the grid you can remap, using pineappl_grid_set_remapper, the 1-dimensional distribution again to your original n-dimensional one.

I decided to do that because you'd have to have different Grid::fill functions depending on how many dimensions you have, but then you could choose the wrong one and what do you do then? I'm not convinced this is the best approach, but for time being that's what we have. Another deciding factor was that most Monte Carlos only support one-dimensional distributions, and so you have to map to bins anyways.

cschwan commented 1 year ago

Opening/closing: that was me having butterfingers :smile: .

alecandido commented 1 year ago

I'm not convinced this is the best approach, but for time being that's what we have.

In a sense is the common one: even Numpy and other array libraries they always store one dimensional data, just because the memory model is linear. Usually, multi-dimensionality is achieved in two ways:

Maybe the bins machinery can be improved, and maybe we can think about a better way to maintain coordinates, possibly leveraging an external array crate like ndarray. But for the grid itself you wanted to manage on your own, because it is already a complicate data structure. The only other option I see is to populate an NDArray of "partial grids" (i.e. order, lumi, and subgrids) one for each bin.

felixhekhorn commented 1 year ago

you need to map them to 1-dimenional one using essentially bins, and after having filled the grid you can remap,

okay, I see - and I understand the technical limitations ... but then I wonder whether I should provide some sugar here on this level ... in any case I will first complete the example, then attempt my implementation and if I feel I need more, do more

cschwan commented 1 year ago

[..] and it can then be used in the future)

In the sense of pinefarm?

felixhekhorn commented 1 year ago

In the sense of pinefarm?

if there would an interest (EIC, maybe? the HERA stuff is mostly already in as combined), why not ... (and keep in mind for the fully inclusive case yadism is already using my code under the hood) otherwise mostly for me (and my collaborators) and my studies ...

felixhekhorn commented 1 year ago

The current implementation reproduces one-to-one the C counter part (now moved into examples/c).

Why I believe this is a good idea:

Things that could be improved:

alecandido commented 1 year ago
  • the interface is not complete, but just makes the example back working - if I'm allowed I'd rather do the completion upon request at a later time; as said the second step is to actually use it and doing so, I'll discover new functions that are needed (e.g. pineappl_grid_set_remapper)

This is exactly what we are doing for the Python wrapper. If it's working there, can work even here :)

  • I use std::list and std::vector in parallel - I usually prefer list since this is the "simplest implementation" as it just requires pointers, however sometimes vector is also nice too have, e.g. for bins for which at is quite nice to have; should I do something about this?

This I would not do: according to Stroustroup (I read it in A tour of C++, that should actual be just a part of The C++ Programming Language) always use a std::vector unless you have a very good reason not to do it. E.g., if you want to implement a dequeue, it is much better to use a std::list (most of the times at least). As well as if you have to insert many elements in the middle, and stuffs like that. But if you are not 90-100% convinced that your object has to be something else, than use std::vector, it is by far the most efficient.

cschwan commented 1 year ago

In almost all cases std::vector<T> is preferable because it lays out its elements subsequently in memory and so you can easily interface with C by using the data member function.

felixhekhorn commented 1 year ago

In almost all cases std::vector<T> is preferable because it lays out its elements subsequently in memory and so you can easily interface with C by using the data member function.

that's why I used vector some times :innocent: - however, everything moved to vector in https://github.com/N3PDF/pineappl/pull/178/commits/16c5b7f193df4dd46cd49e473a541f4850073cb3

alecandido commented 1 year ago

In almost all cases std::vector<T> is preferable because it lays out its elements subsequently in memory and so you can easily interface with C by using the data member function.

That might be one reason, but the most quoted one is because of memory management and operations: in practice std::vector has the best heuristic to limit the amount of allocations, and it has O(1) access performance. (but for example it is rather worse for insertion)

alecandido commented 1 year ago

Ah sorry, I also compiled and ran, and I can get a grid. The only weird thing is that I get practically the same output for c and cpp example, but not the same grid: they differ as binaries (a bit unexpected, but mostly uninformative) and I can not compare them with the CLI, because of the following reason:

Error: bins limits differ
felixhekhorn commented 1 year ago

Ah sorry, I also compiled and ran, and I can get a grid. The only weird thing is that I get practically the same output for c and cpp example, but not the same grid: they differ as binaries (a bit unexpected, but mostly uninformative) and I can not compare them with the CLI, because of the following reason:

Error: bins limits differ

this is due to float arithmetics - I use an iterator to compute the bins, Christopher a fixed list:

$ pineappl diff c/DY-LO-AA.pineappl.lz4 cpp/DY-LO-AA.pineappl.lz4 NNPDF31_nlo_as_0118_luxqed --ignore-bin-limits
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF31_nlo_as_0118_luxqed/NNPDF31_nlo_as_0118_luxqed_0000.dat
NNPDF31_nlo_as_0118_luxqed PDF set, member #0, version 2; LHAPDF ID = 324900
b    x1                O(as^0 a^2)             
--+---+---+------------+------------+----------
 0   0 0.1 5.2631092e-1 5.2631092e-1    0.000e0
 1 0.1 0.2 5.2549083e-1 5.2549083e-1    0.000e0
 2 0.2 0.3 5.2468241e-1 5.2468241e-1  6.661e-16
 3 0.3 0.4 5.1883402e-1 5.1883402e-1 -6.661e-16
 4 0.4 0.5 5.1754819e-1 5.1754819e-1    0.000e0
 5 0.5 0.6 5.0088406e-1 5.0088406e-1    0.000e0
 6 0.6 0.7 4.9053252e-1 4.9053252e-1    0.000e0
 7 0.7 0.8 4.6757338e-1 4.6757338e-1 -1.110e-15
 8 0.8 0.9 4.3931594e-1 4.3931594e-1    0.000e0
 9 0.9   1 3.9929209e-1 3.9929209e-1    0.000e0
10   1 1.1 3.7068008e-1 3.7068008e-1 -9.992e-16
11 1.1 1.2 3.2647167e-1 3.2647167e-1  2.220e-15
12 1.2 1.3 2.8493445e-1 2.8493445e-1    0.000e0
13 1.3 1.4 2.4867234e-1 2.4867234e-1  2.220e-15
14 1.4 1.5 2.1104193e-1 2.1104193e-1    0.000e0
15 1.5 1.6 1.7974387e-1 1.7974387e-1    0.000e0
16 1.6 1.7 1.4714925e-1 1.4714925e-1  2.220e-15
17 1.7 1.8 1.2055655e-1 1.2055655e-1    0.000e0
18 1.8 1.9 9.4916254e-2 9.4916254e-2  2.220e-15
19 1.9   2 7.2557198e-2 7.2557198e-2 -2.109e-15
20   2 2.1 5.0569670e-2 5.0569670e-2    0.000e0
21 2.1 2.2 3.4917881e-2 3.4917881e-2    0.000e0
22 2.2 2.3 1.9675180e-2 1.9675180e-2  4.441e-15
23 2.3 2.4 5.5653059e-3 5.5653059e-3    0.000e0
Thanks for using LHAPDF 6.4.0. Please make sure to cite the paper:
  Eur.Phys.J. C75 (2015) 3, 132  (http://arxiv.org/abs/1412.7420)
cschwan commented 1 year ago

I'm not as experienced as @cschwan in C++, the only thing I can reasonably comment about is the following: you're writing the C++ interface wrapping the C one, but at this point is just sugar for the user, with the same (and possibly more) limitations than the C API.

I would consider making it directly on top of the Rust one, so distributing another crate pineappl_cppapi. But this only if you can limit the amount of boilerplate, because if the two folders have too much in common it makes no sense (we would be just duplicating).

I was thinking about making a similar comment early on, but I think having this code for users more familiar with C++ than C is a good idea, they can simply copy it. It's similar to the Fortran module that we also have in an example directory.

cschwan commented 1 year ago

this is due to float arithmetics - I use an iterator to compute the bins, Christopher a fixed list:

This is because we perform floating point comparisons without any tolerance. If you run cargo clippy there are numerous warnings, which I intend to fix at some point. They could become really problematic if you think about Grid::convolute_eko and the x-grid points.

felixhekhorn commented 1 year ago

Now, I have

$ md5sum DY-LO-AA.pineappl.lz4 
22c711a9b85f75936cd3d21d1cab4d11  DY-LO-AA.pineappl.lz4
$ md5sum ../basic-capi-usage/DY-LO-AA.pineappl.lz4 
e3785e3a8f145c5f9a960e259b5e71a3  ../basic-capi-usage/DY-LO-AA.pineappl.lz4
$ pineappl diff DY-LO-AA.pineappl.lz4 ../basic-capi-usage/DY-LO-AA.pineappl.lz4 NNPDF31_nlo_as_0118_luxqed
LHAPDF 6.4.0 loading /home/felix/local/share/LHAPDF/NNPDF31_nlo_as_0118_luxqed/NNPDF31_nlo_as_0118_luxqed_0000.dat
NNPDF31_nlo_as_0118_luxqed PDF set, member #0, version 2; LHAPDF ID = 324900
b    x1               O(as^0 a^2)           
--+---+---+------------+------------+-------
 0   0 0.1 5.2631092e-1 5.2631092e-1 0.000e0
 1 0.1 0.2 5.2549083e-1 5.2549083e-1 0.000e0
 2 0.2 0.3 5.2468241e-1 5.2468241e-1 0.000e0
 3 0.3 0.4 5.1883402e-1 5.1883402e-1 0.000e0
 4 0.4 0.5 5.1754819e-1 5.1754819e-1 0.000e0
 5 0.5 0.6 5.0088406e-1 5.0088406e-1 0.000e0
 6 0.6 0.7 4.9053252e-1 4.9053252e-1 0.000e0
 7 0.7 0.8 4.6757338e-1 4.6757338e-1 0.000e0
 8 0.8 0.9 4.3931594e-1 4.3931594e-1 0.000e0
 9 0.9   1 3.9929209e-1 3.9929209e-1 0.000e0
10   1 1.1 3.7068008e-1 3.7068008e-1 0.000e0
11 1.1 1.2 3.2647167e-1 3.2647167e-1 0.000e0
12 1.2 1.3 2.8493445e-1 2.8493445e-1 0.000e0
13 1.3 1.4 2.4867234e-1 2.4867234e-1 0.000e0
14 1.4 1.5 2.1104193e-1 2.1104193e-1 0.000e0
15 1.5 1.6 1.7974387e-1 1.7974387e-1 0.000e0
16 1.6 1.7 1.4714925e-1 1.4714925e-1 0.000e0
17 1.7 1.8 1.2055655e-1 1.2055655e-1 0.000e0
18 1.8 1.9 9.4916254e-2 9.4916254e-2 0.000e0
19 1.9   2 7.2557198e-2 7.2557198e-2 0.000e0
20   2 2.1 5.0569670e-2 5.0569670e-2 0.000e0
21 2.1 2.2 3.4917881e-2 3.4917881e-2 0.000e0
22 2.2 2.3 1.9675180e-2 1.9675180e-2 0.000e0
23 2.3 2.4 5.5653059e-3 5.5653059e-3 0.000e0

so they are not the exact same binary, but for pineappl they are the same

cschwan commented 1 year ago

@felixhekhorn if you like you can compare the uncompressed grids, they might be the same. It's likely that lz4 saves a date somewhere.

felixhekhorn commented 1 year ago

@felixhekhorn if you like you can compare the uncompressed grids, they might be the same. It's likely that lz4 saves a date somewhere.

Good idea, but no - the md5 sum is still different ...

felixhekhorn commented 1 year ago

@cschwan I believe to have addressed everything from the review, so I'd be happy to merge - however, it seems I somehow broke the Fortran example in the workflow (local there is no problem) and I've no idea how, do you?

cschwan commented 1 year ago

@felixhekhorn have a look at it again. If you have clang-format could you please try to run over it? Otherwise it can be merged now!

felixhekhorn commented 1 year ago

Apart from stupid stuff, cpplint complains about:

PineAPPL.hpp:10:  Found C system header after C++ system header. Should be: PineAPPL.h, c system, c++ system, other.  [build/include_order] [4]
PineAPPL.hpp:231:  Is this a non-const reference? If so, make const or use a pointer: LHAPDF::PDF &pdf  [runtime/references] [2]

should I do something?

cschwan commented 1 year ago

Apart from stupid stuff, cpplint complains about:

PineAPPL.hpp:10:  Found C system header after C++ system header. Should be: PineAPPL.h, c system, c++ system, other.  [build/include_order] [4]
PineAPPL.hpp:231:  Is this a non-const reference? If so, make const or use a pointer: LHAPDF::PDF &pdf  [runtime/references] [2]

should I do something?

The first one you can follow, what's the reasoning for the second one?

felixhekhorn commented 1 year ago

The first one you can follow, what's the reasoning for the second one?

eeeehm, it's a decision by cpplint and the best I could find is this issue and the respective section in the Google guidelines - and, of course, I usually do the same, as said before: making all inputs const so I can't modify them

we can even decide to ignore it and just merge at this point

cschwan commented 1 year ago

I'll leave you the choice; I consider it good enough as is.