cta-observatory / lst-sim-config

Repository to store configurations of MC simulations for LST (+MAGIC)
0 stars 1 forks source link

Reorganisation of the MC data directory tree #42

Open Voutsi opened 2 years ago

Voutsi commented 2 years ago

Currently the data of the training dataset are stored at: /home/georgios.voutsinas/ws/AllSky/TrainingDataset there are 2 directories, one for protons and one for gamma diff. For each particle type, we have a directory per declination band (exception is the Crab's band which are stored simply in directories called Corsika & sim_telarray - I will move them this WE to a dir called dec_2276). Each declination's band directory splits to a Corsika and a sim_telarray dir, and in each one of this dirs we have a directory per node.

The structure is illustrated in the example directory tree attached below.

data_dir_tree

Please let me know if this scheme is satisfactory or we should organise the data in a more optimal way.

moralejo commented 2 years ago

Thanks, looks good to me. Other opinions?

maxnoe commented 2 years ago

@Voutsi In your screenshot I see that you are using the gzip compression for the simtel files.

We should use zstd, it is slightly faster to write, a bit smaller and much faster to read. It is also what is used for standard CTA productions.

Voutsi commented 2 years ago

Thanks @maxnoe , I was not aware of that. As long as the pipeline can digest .zstd files I agree we should change.

rlopezcoto commented 2 years ago

@Voutsi thanks, I agree with the proposed organization, but according to Daniel Mazin's comment from today, you may be having troubles writing all the MC in your home folder, shall we star transferring them elsewhere in the organization tree?

Voutsi commented 2 years ago

@rlopezcoto the MC is stored in fefs:

/fefs/aswg/workspace/georgios.voutsinas/AllSky/

I have a symlink in my home folder pointing to the storage space at fefs and this is what I showed in the slides today (I agree it was confusing...)

Or you mean that I will have problems to store it in /fefs/aswg/workspace/georgios.voutsinas/ also?

rlopezcoto commented 2 years ago

no, in the workspace folder it should be fine, no limits there so far

jsitarek commented 2 years ago

since those will be the first really official MCs to be used by many analyzers, maybe it would be good to put the main path specifying that it is LSTProd2 something like /fefs/aswg/mc/LSTProd2/... (to mimic what is done for the data) actually there is a directory /fefs/aswg/data/mc with some old 2020/2021 files, those can be moved away and .../data/mc could be used as well (but I think it is more confusing than just ...aswg/mc

another thing, while corsika is expected to be just one directory, sim_telarray will be run multiple times with various settings, so I think it would be good to add to "sim_telarray" some tags describing the time period for which they are produced (dates or analysis periods), and settings ("nominal", "low_NSB" or something like this).

Voutsi commented 2 years ago

Hi @jsitarek sounds good to me, so I create a /fefs/aswg/mc/LSTProd2/, create the same directory structure, and then sym-linking data files, configs & logs.

Sure, I can add a suffix in the sim_telarray directories. I understand that now we produce the nominal ones.

Voutsi commented 2 years ago

Hi @maxnoe zstd is not installed and I don't have the privileges to do it. Shall we request the admins or someone can install it?

maxnoe commented 2 years ago

@Voutsi To have it in the system, yes ask the admins. It's however also available in the lstchain conda environments already.

vuillaut commented 2 years ago

Some of this discussion happened in emails, my bad I missed the discussion in this repo ! So I will duplicate what I wrote previously here. These are only my thoughts from what worked in the previous prods, I may be missing (technical) points, so take everything as suggestions.

  1. Please symlink all (not only Test) productions under /fefs/aswg/data/mc/DL0

    • the DL0 is not entirely accurate I agree, this is what we had been using until now but can be changed
    • then the other data levels will follow the same structure so it's easy for users to understand
  2. In the symlinked structure, I would argue that it should be as simple as possible, removing intermediate single directories (e.g. sim_telarray, output...), and presumably log and job files, corsika files, etc...

  3. different MC settings should lead to a different "MC prod ID" - much like 20200629_prod5_trans_80 with all produced files using these settings under that dir

    • thus not having loose and not very self-explanatory v1.4 lower in the tree structure
  4. use the same nomenclature everywhere for similar things - it will be clearer and help parsing

    • for example node_theta_xx_az_yy vs node_corsika_theta_xx_az_yy

A final structure could look like this:

/fefs/aswg/data/mc/DL0
  └── allsky_v1.4_trig_xxx_trans_zzz
      ├── Testing
          ├── node_theta_xx_az_yy
          ├── node_theta_xx_az_yy
          └── node_theta_xx_az_yy
      └── Training
          ├── GammaDiffuse
          │   ├── dec_2276
          │   └── dec_3476
          │       ├── node_theta_xx_az_yy
          │       └── node_theta_xx_az_yy
          └── Protons
            ├── dec_2276
            └── dec_3476
                  ├── node_theta_xx_az_yy
                  └── node_theta_xx_az_yy

EDIT: Georgios made me realise declination should be lower in the tree so I edited the example accordingly