NNPDF / pineappl

PineAPPL is not an extension of APPLgrid
https://nnpdf.github.io/pineappl/
GNU General Public License v3.0
12 stars 3 forks source link

Add offline file-size optimisation #45

Open cschwan opened 3 years ago

cschwan commented 3 years ago

Possible optimisations:

cschwan commented 3 years ago

Pull request https://github.com/N3PDF/pineappl/pull/48 implements a more efficient data structure.

cschwan commented 3 years ago

I've tested PR #48 with the complete ATLAS DY 3D grid, before (LagrangeSubgridV1) and after (LagrangeSparseSubgridV1) the optimisation, and both with and without LZ4 compression. Here's the table:

Compression LagrangeSubgridV1 LagrangeSparseSubgridV1
none 3.1 GB 497 MB
LZ4 377 MB 364 MB

The numbers before compression are basically the memory requirements when loading the grid (for convolutions, etc.). Due to the smaller size also convolutions of the grid with PDFs are faster: from 22 seconds down to 16 seconds, where the LZ4 compression virtually makes no difference.

cschwan commented 3 years ago

To optimise a grid, simply run pineappl optimize input.pineappl optimized.pineappl; no re-generation of the grid is needed.

cschwan commented 3 years ago

Here a comparison against APPLgrid (using the converter from PR #17 and CMS_SINGLETOP_TCH_R_7TEV_T.root):

Setup File size
APPLgrid 2.1 MB
converted 1.9 MB
converted+optimized 1.5 MB
converted+optimized+compressed 1.4 MB
converted+optimized+symmetrized+compressed 683K
cschwan commented 3 years ago

Commit 1ca0a55f220cb26c1446bf8107e0475907d06361 further optimizes the file sizes of grids for initial-state symmetric processes (for instance proton-proton collisions) by making use of the symmetry of the double-sum over the interpolated x1 and x2.

cschwan commented 3 years ago

Commits 098fe5d04ec9d8eb8a7ee4d1848bfc174e316245 and a0f32fc5906e74abc2e48c568086feb3c560ed09 further decrease the size of all grids that have a static scale (different static scales in different bins are also optimised). The size improvement is a factor of four by default (the interpolation degree plus one).

This optimisation modifies the numerical value of the convolution, since the PDFs are no longer evaluated at multiple q2 grid points, but instead at the single static scale; however, in general the result should be more accurate, because one interpolation dimension is removed.

cschwan commented 3 years ago

Using ./appl2pine (see also https://github.com/N3PDF/pineappl/issues/17) I converted all grids (except two) into PineAPPL grids. The following are my observations:

cschwan commented 3 years ago

Commit fce09e13383909043dc104434d9f6be45072bb59 removes empty luminosity entries, which is primarily required for generating smaller FK tables.

felixhekhorn commented 2 years ago

@scarlehoff just came across this one: "strip numerical zeros" - maybe we can increase the priority?

cschwan commented 2 years ago

In the meantime the size of CMS_SINGLETOP_TCH_R_7TEV_T has slightly degraded to 684K, but the ATLAS DY 3D @ 8 TeV grid has shrunk to 66 MB from 364 MB.

alecandido commented 2 years ago

In the meantime the size of CMS_SINGLETOP_TCH_R_7TEV_T has slightly degraded to 684K

Definitely not a problem

but the ATLAS DY 3D @ 8 TeV grid has shrunk to 66 MB from 364 MB.

That's great :)

cschwan commented 2 years ago

Here's an update of the numbers from https://github.com/N3PDF/pineappl/issues/45#issuecomment-838720045, using the CLI pineappl v0.5.5-19-gd924e9e and converting https://github.com/NNPDF/applgrids/commit/89440895f95dc6747560f52a2860b69ed70e9b48:

That's a -46% reduction!

alecandido commented 2 years ago
  • PineAPPL grids: 3338 MBytes

That's a -46% reduction!

Are you comparing with the .pineappl or .pineappl.lz4? I'd just like to decouple PineAPPL optimization from lz4 compression :) (or are the APPLgrids compressed as well?)

cschwan commented 2 years ago

Are you comparing with the .pineappl or .pineappl.lz4? I'd just like to decouple PineAPPL optimization from lz4 compression :) (or are the APPLgrids compressed as well?)

The PineAPPL grids are LZ4 compressed and as far as I understand the ROOT file format is ZLIB compressed^1, so in that sense it's a fair comparison I think.

However, you might wonder how good or helpful the compression in PineAPPL's case is, so I added the number without compression in the comment above.

alecandido commented 2 years ago

Ok, good. Then PineAPPL is already doing a great job on its own :D

The PineAPPL grids are LZ4 compressed and as far as I understand the ROOT file format is ZLIB compressed1, so in that sense it's a fair comparison I think.

Perfect, it was reasonable.

  • PineAPPL grids: 3338 MBytes (without compression: 3707 MBytes)

I wonder if there is a reason why LZ4 compression is doing so little. In some sense, that's a good sign on its own. In eko it is changing a lot, because we are saving almost-triangular matrices in rectangular ones, with plenty of zeros - that is not the smartest choice for storage, but it was the best compromise for usage (maybe at some point we might want to reconsider, to see if we can equally good support for almost-triangular, saving memory and operations @felixhekhorn).

cschwan commented 2 years ago

I think the reason is that the format is binary with already very small entropy (one f64 is 8 bytes). Are you compressing text files/yaml? In that case the compression should work much better.

felixhekhorn commented 2 years ago

I think the reason is that the format is binary with already very small entropy (one f64 is 8 bytes). Are you compressing text files/yaml? In that case the compression should work much better.

In eko we're compressing .npy

maybe at some point we might want to reconsider, to see if we can equally good support for almost-triangular, saving memory and operations @felixhekhorn

we can - but this is a N3LO problem, I'd say

alecandido commented 2 years ago

we can - but this is a N3LO problem, I'd say

N3LO is essentially now :) (however, let's discuss somewhere else)