Add offline file-size optimisation

cschwan commented 4 years ago

Possible optimisations:

[x] symmetrisation of the initial states if the hadronic initial states are the same
[x] better file format, which doesn't save zeros; if the optimisation is offline, we can try several sparse matrix ideas like DOK, LIL, COO, CSR, BSR, ...
[ ] optimise luminosity function:
- [x] strip empty entries
- [ ] strip numerical zeros, see also https://github.com/NNPDF/runcards/pull/107
- [ ] #199
- [ ] strip entries with factor = 0.0 (why are they there in the first place?)
- [x] #246
[x] static scale detection; remove interpolation of Q^2 in this case
[ ] #119
[ ] use bincode's varint encoding, part of #118
[x] #117
[x] #151
[x] #202

cschwan commented 3 years ago

Pull request https://github.com/N3PDF/pineappl/pull/48 implements a more efficient data structure.

cschwan commented 3 years ago

I've tested PR #48 with the complete ATLAS DY 3D grid, before (LagrangeSubgridV1) and after (LagrangeSparseSubgridV1) the optimisation, and both with and without LZ4 compression. Here's the table:

Compression	LagrangeSubgridV1	LagrangeSparseSubgridV1
none	3.1 GB	497 MB
LZ4	377 MB	364 MB

The numbers before compression are basically the memory requirements when loading the grid (for convolutions, etc.). Due to the smaller size also convolutions of the grid with PDFs are faster: from 22 seconds down to 16 seconds, where the LZ4 compression virtually makes no difference.

cschwan commented 3 years ago

To optimise a grid, simply run pineappl optimize input.pineappl optimized.pineappl; no re-generation of the grid is needed.

cschwan commented 3 years ago

Here a comparison against APPLgrid (using the converter from PR #17 and CMS_SINGLETOP_TCH_R_7TEV_T.root):

Setup	File size
APPLgrid	2.1 MB
converted	1.9 MB
converted+optimized	1.5 MB
converted+optimized+compressed	1.4 MB
converted+optimized+symmetrized+compressed	683K

cschwan commented 3 years ago

Commit 1ca0a55f220cb26c1446bf8107e0475907d06361 further optimizes the file sizes of grids for initial-state symmetric processes (for instance proton-proton collisions) by making use of the symmetry of the double-sum over the interpolated x1 and x2.

Up to numerical noise the results are unchanged.
The size optimization is a real reduction of information, so that both the compressed and uncompressed sizes of the grids are halved (see also table above).
Convolution time is also twice as fast.
If FK tables are generated from the optimized grids their size should also be cut in half.

cschwan commented 3 years ago

Commits 098fe5d04ec9d8eb8a7ee4d1848bfc174e316245 and a0f32fc5906e74abc2e48c568086feb3c560ed09 further decrease the size of all grids that have a static scale (different static scales in different bins are also optimised). The size improvement is a factor of four by default (the interpolation degree plus one).

This optimisation modifies the numerical value of the convolution, since the PDFs are no longer evaluated at multiple q2 grid points, but instead at the single static scale; however, in general the result should be more accurate, because one interpolation dimension is removed.

cschwan commented 3 years ago

Using ./appl2pine (see also https://github.com/N3PDF/pineappl/issues/17) I converted all grids (except two) into PineAPPL grids. The following are my observations:

in the worst case scenario the PineAPPL grid is 0.3% percent larger than the APPLgrid. This happens 4 times out of 489
in the best case scenario the PineAPPL grid is 98.5% percent smaller, however in this case the APPLgrid is empty
the total size of all APPLgrids is 5.8 GB, the converted PineAPPL grids are 3.7 GB large. That's a reduction of 36% in size

cschwan commented 3 years ago

Commit fce09e13383909043dc104434d9f6be45072bb59 removes empty luminosity entries, which is primarily required for generating smaller FK tables.

felixhekhorn commented 2 years ago

@scarlehoff just came across this one: "strip numerical zeros" - maybe we can increase the priority?

cschwan commented 2 years ago

In the meantime the size of CMS_SINGLETOP_TCH_R_7TEV_T has slightly degraded to 684K, but the ATLAS DY 3D @ 8 TeV grid has shrunk to 66 MB from 364 MB.

alecandido commented 2 years ago

In the meantime the size of CMS_SINGLETOP_TCH_R_7TEV_T has slightly degraded to 684K

Definitely not a problem

but the ATLAS DY 3D @ 8 TeV grid has shrunk to 66 MB from 364 MB.

That's great :)

cschwan commented 2 years ago

Here's an update of the numbers from https://github.com/N3PDF/pineappl/issues/45#issuecomment-838720045, using the CLI pineappl v0.5.5-19-gd924e9e and converting https://github.com/NNPDF/applgrids/commit/89440895f95dc6747560f52a2860b69ed70e9b48:

APPLgrids: 6214 MBytes (full repository without .git subfolder)
PineAPPL grids: 3338 MBytes (without compression: 3707 MBytes)

That's a -46% reduction!

alecandido commented 2 years ago

PineAPPL grids: 3338 MBytes

That's a -46% reduction!

Are you comparing with the .pineappl or .pineappl.lz4? I'd just like to decouple PineAPPL optimization from lz4 compression :) (or are the APPLgrids compressed as well?)

cschwan commented 2 years ago

Are you comparing with the .pineappl or .pineappl.lz4? I'd just like to decouple PineAPPL optimization from lz4 compression :) (or are the APPLgrids compressed as well?)

The PineAPPL grids are LZ4 compressed and as far as I understand the ROOT file format is ZLIB compressed^1, so in that sense it's a fair comparison I think.

However, you might wonder how good or helpful the compression in PineAPPL's case is, so I added the number without compression in the comment above.

alecandido commented 2 years ago

Ok, good. Then PineAPPL is already doing a great job on its own :D

The PineAPPL grids are LZ4 compressed and as far as I understand the ROOT file format is ZLIB compressed1, so in that sense it's a fair comparison I think.

Perfect, it was reasonable.

PineAPPL grids: 3338 MBytes (without compression: 3707 MBytes)

I wonder if there is a reason why LZ4 compression is doing so little. In some sense, that's a good sign on its own. In eko it is changing a lot, because we are saving almost-triangular matrices in rectangular ones, with plenty of zeros - that is not the smartest choice for storage, but it was the best compromise for usage (maybe at some point we might want to reconsider, to see if we can equally good support for almost-triangular, saving memory and operations @felixhekhorn).

cschwan commented 2 years ago

I think the reason is that the format is binary with already very small entropy (one f64 is 8 bytes). Are you compressing text files/yaml? In that case the compression should work much better.

felixhekhorn commented 2 years ago

I think the reason is that the format is binary with already very small entropy (one f64 is 8 bytes). Are you compressing text files/yaml? In that case the compression should work much better.

In eko we're compressing .npy

maybe at some point we might want to reconsider, to see if we can equally good support for almost-triangular, saving memory and operations @felixhekhorn

we can - but this is a N3LO problem, I'd say

alecandido commented 2 years ago

we can - but this is a N3LO problem, I'd say

N3LO is essentially now :) (however, let's discuss somewhere else)

cschwan commented 2 weeks ago

This has long been implemented, let's open a new Issue for more optimizations.

NNPDF / pineappl

Add offline file-size optimisation #45