Closed scarlehoff closed 1 year ago
If possible, we would need also a public location, such that also people outside the collaboration will be able to use those assets for producing theories (this is a minor requirement for the time being, a strong one for the Como school).
@scarrazza things we are moving towards having papers done with the new pipeline, do you think we could get extra storage from the INFN to have some kind of (manual if necessary) version control for the theories used in the papers?
We can try, send me an estimate of required space.
The current weight of all theories is ~250GB. I think 1 or 2 TB will be ok for now since this already includes the MHOU
The current weight of all theories is ~250GB. I think 1 or 2 TB will be ok for now since this already includes the MHOU
The main problem is not MHOU (which are 8 (variations) x 2 (NLO + NNLO)), but N3LO, which has ~80 variations.
True, @giacomomagni what is the current combined weight of the N3LO theories?
If we want to store ekos they will be quite huge. So right now I'm using 81 (from 439 to 519) theories.
In term of GBs how much would you need? (and how much would the ekos will be)
For a full single theory ekos should be 4.3 GB and for a dis only 1.6 GB. Fktables are order of MB if I'm not mistaken. I believe having a NNLO or N3LO doesn't change much, it's just the large number of theory needed.
This is definitely needed now. We are at 20 GB. I'm going to write a small utility to do this and we can iterate over it and make it more useful / functional / robust later.
This is what I'm thinking on doing, please chime in with any issues that you spot or ideas that you have. In this first version let's try to make it something that works in a very simple manner, that people can use and that doesn't block us in the future.
nnpdf
server. This should have enough memory (for now) and it has a backup for when a disaster happen (again).gridname.pineappl.lz4
they will be text files gridname.txt
.Then we have a python script with two arguments:
theory_script download 400
Will download the current theory 400. This means that it will go .txt
file by .txt
file and download whatever is in gridname.txt
(which will be a hashed name+some_info_from_pineappl or whatever name of the fktable in the nnpdf server).
theory_script upload 400
will instead create (or update if it has changed) the .txt file and upload the changes/new ones to the repository.
Immediate problems I see with this:
theory_script check 400
which checks that the theory 400 in the server corresponds to the local .txt
files.
Then the user has to commit the .txt files with the new names. I'm saying .txt
but they can be .yaml
just the same so that one can add even comments or extra metadata if one wishes. But in the immediate future I want to limit this in scope so that we have it / start using it as soon as possible.
Then the user has to commit the .txt files with the new names. I'm saying
.txt
but they can be.yaml
just the same so that one can add even comments or extra metadata if one wishes. But in the immediate future I want to limit this in scope so that we have it / start using it as soon as possible.
They are an incredible bunch of files, let's keep it simple: just use files that contain the path on the server, and nothing else. Comments on the generated grid could be embedded in the grids themselves, other kind of comments could be written in other meta files (e.g. one per theory, or stuffs like that) not involved in this linking procedure.
Someone would say that you're essentially about to rewrite vp-upload
, and you should make this an action :)
But for me it's fine, theories are living in a separate world from vp. However, I would consider making it (at some point, not immediately) part of Pineko, since it's already a CLI managing theories, just not to proliferate too much on repos and tools.
What do we actually need? i.e. what format are you proposing? like on top? I can see 3 levels:
.txt
for each grid for each theory with e.g. the content: https://nnpdf.science/t/400/atlasbla.pineappl.lz4
.yaml
for each theory with e.g. the content see top or even ATLASBLA: https://nnpdf.science/t/400/atlasbla.pineappl.lz4\nCMSBLUB: https://nnpdf.science/t/400/CMSBLUB.pineappl.lz4
- or 2) (I think @AleCandido is having this in mind, no?) have a .txt
for each theory with the content https://nnpdf.science/t/400/
; name of grids are implicit, i.e. only one name is allowed inside the folder (e.g. prefix with _
to exclude)
.txt
for each server with the content https://nnpdf.science/t/
and leave again theory indices and grid names implicit
Server level no. The global registry is the repository. Only one server supported at this point.
Either theory level or grid level depending on my own feelings when writing the script (I don't think having a file per grid or a file per theory changes much things).
(I think @AleCandido is having this in mind, no?)
Nope, I definitely had 1) in mind, because it's more granular and more composable. They will never be manually edit, @scarlehoff proposed a CLI, so I still believe it is the best option, even because it's the simplest: one file for one file.
Closed through https://github.com/NNPDF/theories_slim/issues/1
At the moment the heavy part of the theory, the grids, are stored in this repository. This is a temporary solution. The grids should live somewhere else (in some external server) and instead the grids folder should be substituted by some kind of file that keeps track of grids and changes.
Something like:
The
name_in_remote_folder
should be an unique identifier (to first approximation it can betheory_dataset_date
) so that if a fix is necessary for a given grid, both the fixed and broken grids can be kept in the server and the changes to thegrid_meta.yaml
file can be tracked in this repository.@scarrazza can we use the INFN for this? We would need a lot of storage potentially.