Open scarlehoff opened 5 months ago
We took considered this many times, and it was part of the initial concept behind EKO.
The only reason why we gave up is the size of the EKOs. If this is not an issue any longer, and we have a lot of storage, that could be an interesting option (storing EKOs is essentially a space-time tradeoff, if you are actually reusing them).
We discussed for quite some time about compressing all the FK tables EKOs into a single one. Indeed, the terminology used was *small* EKO and *big* EKO for a theory. The small one was the postfit (that is actually computed and there), and the big one would have been the union of all the EKOs used to generate the FK tables. However, since EKOs are squared wrt to the DIS grid (not exactly squared for the double hadronic) the requirement to store them was considered to be absurd (just think about merging all the jets EKOs), and we gave up...
More specifically, you could also reuse EKO subsets, or just recompute a subset, if some configurations are matching (the theory + evmod + scale variations + ...).
The only reason why we gave up is the size of the EKOs
Why? I was thinking (hoping!) precisely on a way of reducing the amount of EKOs needed since, for instance, most DY datasets are probably sharing them.
at some point we also considered adding a interpolation in Q2 (or better $\log(Q^2)$ ) ... (i.e. to not compute for every $Q^2 \in \mathbb R$ possible)
Why? I was thinking (hoping!) precisely on a way of reducing the amount of EKOs needed since, for instance, most DY datasets are probably sharing them.
They usually involve different scales.
The relevant settings are controlled by the theory, so they should not change per-dataset. And if we're computing the grids (when it happens) we can even control the momentum fraction grid (ok, this is not the case of jets and the other Ploughshare grids). But the Q2 scales we do not control*...
*well, there is a limit in which you can control the dynamical scale choice, but there are external recommendations on how to pick it, so it is not something we can lightheartedly use for computing optimization
They usually involve different scales.
Many of the DY datasets just have muF = muR = mZ
.
That's why I'm thinking the same family of processes might often share the scales. Even when they are dynamic.
(and even more so if we have the same process binned across some variable the scale does not depend on, like some of the 2D distributions)
However, these datasets are often not problematic, because if your scale does not depend on the bin, you often have a single one per dataset, and those EKOs are small (the usual len(xgrid) ** 2 * len(flavors) ** 2 * size(float)
, without the Q2 factor). And the computational demand is proportional to the size.
All the big EKOs are coming from having many scales. To the best of my knowledge.
If I recall correctly a big problem were jet measurements at the LHC where also the xgrid wasn't constant over Q2. That led to the biggest EKOs that I've seen so far. But we could try to implement a hybrid approach in which we have several EKOs:
So the best of both worlds essentially.
If I recall correctly a big problem were jet measurements at the LHC where also the xgrid wasn't constant over Q2.
This was due to them not being originally pineappl grids, right?
If I recall correctly a big problem were jet measurements at the LHC where also the xgrid wasn't constant over Q2.
I remember this as well. However, it should not be a big deal: each Q2 value is computed separately in EKO, so sharing the same xgrid
on different scales it is only helpful for the common part (up to the matching).
I.e. if each value of Q2 has its own xgrid, it could be up to 3x computation (up-to-bottom evolution + bottom matching + from-bottom evolution - since NNPDF Q0 is in 4 flavors, and ignoring above top). But if that's not the case, the overhead should be small.
This was due to them not being originally pineappl grids, right?
For sure.
In principle and in practice we also have preferred Q2 values; if a dynamic scale is chosen, the Q2 points of newly-generated grids should always be a subset of 40 fixed values. The only exception to this for new grids comes from datasets where we chose a static scale value. But then there's only one Q2 per dataset/bin.
We could choose not to make a static-scale optimization and then we already know which EKOs are needed: only the ones for known 50 x grid values and the 40 Q2 values.
With one Q2 value per dataset + 40, and 50 xgrid points, it would be a very reasonable EKO even the "big one" per theory (the FK table EKO, as opposed to the postfit EKO). It'd be certainly sizeable, but reasonable.
However, we still have many wild (imported) grids. Are we planning to recompute them soon? Have they already been recomputed?
For the old theories that ship has sailed of course, but we are recomputing many grids as we are preparing the theory for 4.1
I think only singletop, jets and dijets will not be native pineappl grids. And both jet and dijets should be pineappl-able since they are processes included in nnlojet.
I think only singletop, jets and dijets will not be native pineappl grids. And both jet and dijets should be pineappl-able since they are processes included in nnlojet.
Using separate EKOs for those seems like a good compromise.
For the old theories that ship has sailed of course, but we are recomputing many grids as we are preparing the theory for 4.1
Whatever you're doing, the proposal is of course for new theories. The old one could be at most deprecated in favor of new ones (because of known bugs/limitations, and the files could be dropped in the very long term).
I think only singletop, jets and dijets will not be native pineappl grids. And both jet and dijets should be pineappl-able since they are processes included in nnlojet.
How much computation would be required to pinefarm them?
At the moment in order to generate a theory we need to generate an insane amount of EKOs.
However, due to the fact that many datasets are sharing the scale and that pineappl grids all share the same x-factors, it should be possible to generate a cache of EKOs (for a given theory).
So for instance, if I'm going to run:
Pineko should be able to 1) Read all the ekos already present in the eko folder (in which the eko for dataset_n will be generated) 2) Read the relevant operator cards (no need to parse up all ekos) 3) Find out where some of the operators for
dataset_n
are already computed and take them directly from there.The (ideal) next step would be to not save all operators, but just the union of all operators requested in all operators cards.
I'm wondering whether this is a crazy idea or this could be doable. I'm particularly interested in the ideal next step since I'm having storage problems...