Put the actual grid in a more reasonable location

scarlehoff commented 1 year ago

At the moment the heavy part of the theory, the grids, are stored in this repository. This is a temporary solution. The grids should live somewhere else (in some external server) and instead the grids folder should be substituted by some kind of file that keeps track of grids and changes.

Something like:

- ATLASBLABLA: <name_in_remote_folder>.pineappl.lz4
- CMSBLUBLU: <name_in_remote_folder>.pineappl.lz4

The name_in_remote_folder should be an unique identifier (to first approximation it can be theory_dataset_date) so that if a fix is necessary for a given grid, both the fixed and broken grids can be kept in the server and the changes to the grid_meta.yaml file can be tracked in this repository.

@scarrazza can we use the INFN for this? We would need a lot of storage potentially.

alecandido commented 1 year ago

If possible, we would need also a public location, such that also people outside the collaboration will be able to use those assets for producing theories (this is a minor requirement for the time being, a strong one for the Como school).

scarlehoff commented 1 year ago

@scarrazza things we are moving towards having papers done with the new pipeline, do you think we could get extra storage from the INFN to have some kind of (manual if necessary) version control for the theories used in the papers?

scarrazza commented 1 year ago

We can try, send me an estimate of required space.

scarlehoff commented 1 year ago

The current weight of all theories is ~250GB. I think 1 or 2 TB will be ok for now since this already includes the MHOU

alecandido commented 1 year ago

The current weight of all theories is ~250GB. I think 1 or 2 TB will be ok for now since this already includes the MHOU

The main problem is not MHOU (which are 8 (variations) x 2 (NLO + NNLO)), but N3LO, which has ~80 variations.

scarlehoff commented 1 year ago

True, @giacomomagni what is the current combined weight of the N3LO theories?

giacomomagni commented 1 year ago

If we want to store ekos they will be quite huge. So right now I'm using 81 (from 439 to 519) theories.

scarlehoff commented 1 year ago

In term of GBs how much would you need? (and how much would the ekos will be)

giacomomagni commented 1 year ago

For a full single theory ekos should be 4.3 GB and for a dis only 1.6 GB. Fktables are order of MB if I'm not mistaken. I believe having a NNLO or N3LO doesn't change much, it's just the large number of theory needed.

scarlehoff commented 1 year ago

This is definitely needed now. We are at 20 GB. I'm going to write a small utility to do this and we can iterate over it and make it more useful / functional / robust later.

This is what I'm thinking on doing, please chime in with any issues that you spot or ideas that you have. In this first version let's try to make it something that works in a very simple manner, that people can use and that doesn't block us in the future.

The grids themselves will be in the nnpdf server. This should have enough memory (for now) and it has a backup for when a disaster happen (again).
Instead of having the grids here as a gridname.pineappl.lz4 they will be text files gridname.txt.

Then we have a python script with two arguments:

theory_script download 400

Will download the current theory 400. This means that it will go .txt file by .txt file and download whatever is in gridname.txt (which will be a hashed name+some_info_from_pineappl or whatever name of the fktable in the nnpdf server).

theory_script upload 400

will instead create (or update if it has changed) the .txt file and upload the changes/new ones to the repository.

Immediate problems I see with this:

There's no guarantee that the grid has been correctly uploaded. So if the upload has failed at some point the user might think they uploaded the grid when they didn't. The easier solution to this is a third argument:

theory_script check 400

which checks that the theory 400 in the server corresponds to the local .txt files.

Then the user has to commit the .txt files with the new names. I'm saying .txt but they can be .yaml just the same so that one can add even comments or extra metadata if one wishes. But in the immediate future I want to limit this in scope so that we have it / start using it as soon as possible.

alecandido commented 1 year ago

Then the user has to commit the .txt files with the new names. I'm saying .txt but they can be .yaml just the same so that one can add even comments or extra metadata if one wishes. But in the immediate future I want to limit this in scope so that we have it / start using it as soon as possible.

They are an incredible bunch of files, let's keep it simple: just use files that contain the path on the server, and nothing else. Comments on the generated grid could be embedded in the grids themselves, other kind of comments could be written in other meta files (e.g. one per theory, or stuffs like that) not involved in this linking procedure.

Someone would say that you're essentially about to rewrite vp-upload, and you should make this an action :) But for me it's fine, theories are living in a separate world from vp. However, I would consider making it (at some point, not immediately) part of Pineko, since it's already a CLI managing theories, just not to proliferate too much on repos and tools.

felixhekhorn commented 1 year ago

What do we actually need? i.e. what format are you proposing? like on top? I can see 3 levels:

grid level: have a .txt for each grid for each theory with e.g. the content: https://nnpdf.science/t/400/atlasbla.pineappl.lz4
- PRO: very flexible
- CON: effort to keep everything synced
theory level: 1) have a .yaml for each theory with e.g. the content see top or even ATLASBLA: https://nnpdf.science/t/400/atlasbla.pineappl.lz4\nCMSBLUB: https://nnpdf.science/t/400/CMSBLUB.pineappl.lz4 - or 2) (I think @AleCandido is having this in mind, no?) have a .txt for each theory with the content https://nnpdf.science/t/400/; name of grids are implicit, i.e. only one name is allowed inside the folder (e.g. prefix with _ to exclude)
- for 1): same as grid level, but maybe easier to maintain, since you can do search&replace in a single file
- 2 PRO: reduce duplication - grid name is still repeated in 1)
- 2 CON: grid detection would require a grep, but that can eventually be cached; how to deal with several versions (like bugfixes)
server level: 1) same as above: just have one global registry - or 2) have a .txt for each server with the content https://nnpdf.science/t/ and leave again theory indices and grid names implicit
- PRO/CON: see above

scarlehoff commented 1 year ago

Server level no. The global registry is the repository. Only one server supported at this point.

Either theory level or grid level depending on my own feelings when writing the script (I don't think having a file per grid or a file per theory changes much things).

alecandido commented 1 year ago

(I think @AleCandido is having this in mind, no?)

Nope, I definitely had 1) in mind, because it's more granular and more composable. They will never be manually edit, @scarlehoff proposed a CLI, so I still believe it is the best option, even because it's the simplest: one file for one file.

scarlehoff commented 1 year ago

Closed through https://github.com/NNPDF/theories_slim/issues/1

NNPDF / theories

Put the actual grid in a more reasonable location #1