CMS - create a script for derived data records

katilp commented 1 year ago

CMS 2016 release will include several "derived data" records structurally similar to e.g. https://opendata.cern.ch/record/12341 They will be:

PF-enriched samples produced from the 2016 OD MiniAOD samples
NanoAOD-type derivation from Run1 OD AOD samples

We should have a script template to create such records, that can be run in similar way as those for collision or MC records.

For the provenance, they will link to the parent dataset and the SW that was used to produce them (e.g. Run1 Nano: https://github.com/cernopendata/opendata.cern.ch/issues/3281). Both will be available as CODP records. So need for extended provenance listing as it is already available in the parent dataset record.

For the variable description, these records can link to listings of this type. This html files (one per type of production) should be hosted on the OD portal.

In the scripts, all metadata variables should be collected to the start of the script, for the ease of reuse.

nancyhamdan commented 1 year ago

Can refer to this script from cms-2012-event-display-files as a starting point for new script for derived datasets, taking into account the following notes:

Dataset listings for each type that will be used as input for the script can be found in eos eos ls /eos/opendata/cms/derived-data/ e.g. eos ls /eos/opendata/cms/derived-data/POET/23-Jul-22/. "merged" files should be excluded
authors field will be set as "CMS Open Data Group" for all datasets
number_files can be extracted from the derived-data directory in eos
methodology corresponding to the "How were these data selected" section will differ from one type to another, can be changed in the Jinja html template with some conditional logic. File linked in this section will be a link to a corresponding software record
Refer to cernopendata/opendata.cern.ch/issues/3349 for title of each type of dataset
Values for description under abstract can be extracted from a csv file that has datasets' descriptions. We could have a description template for each type of derived datasets
dataset_semantics corresponding to the "Dataset characteristics" section can also be extracted as input from a csv file listing them for each dataset
Instead of having to extract values for description and dataset_semantics from csv files, and to easily be able to configure the script with any other necessary hard coded values, we could have a yaml file that has these values and the script could take this file as input. The yaml file could have similar structure to this:
```
common values:
collision_energy: "0.2TeV" 
keywords:
    - education
    - outreach 
description: >
    This is a very long description saying this and that. It can even have many lines. So it could be quite comfortable to 
    enter desired dataset descriptions even if they are really long.
```

dataset: name: "/BTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 123456789 dataset: name: "/CTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 456 dataset: name:" /DTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 456324234

katilp commented 1 year ago

For the number of events, you can use the following (if running where ROOT is available)

NanoAODRun1 and PFNano:

import ROOT
myfile = ROOT.TFile.Open("http://opendata.cern.ch/eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2012C_SingleMu/FF6C9C2D-3B37-43D7-A9B0-043CB2AC8202.root")
myfile.Events.GetEntries()

In the older versions, the event number might appear as a "long" integer, e.g. 563709L, in that case, int(myfile.Events.GetEntries())

POET:

The POET output has a different structure, and there are two versions of it:

for the "flat" files, everything is under the events tree : myfile.events.GetEntries() (NB: lower case e)
for the direct POET output (no "flat"), the values are under object-specific trees, and an intermediate tree needs to be defined: myfile.myelectrons.Events.GetEntries() (NB: upper case E)

The number of events is the same in both cases.

katilp commented 11 months ago

Further details for the three types of derived data :

POET

Files under /eos/opendata/cms/derived-data/POET/23-Jul-22/

These are the files used in the 2022 workshop lesson https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/

For each dataset, we have:

<dataset> directory with root files as a direct output of POET (separate trees for each object)
<dataset>_flat directory with root files "flattened" to a single tree, as required when used as input to coffea with nanoevents schema
<dataset>_flat.root with the separate files in the <dataset>_flat directory merged into one file

e.g.

RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat.root

Finally, no reason to leave out the merged file, we can as well have it in the record. So all files go in a single derived <dataset> record:

title: <dataset> dataset in reduced NanoAOD-like format
description: This dataset contains information extracted from different physics objects from the 2015 MiniAOD parent <dataset> dataset, readable with bare ROOT or other ROOT-compatible software. It was produced for the CMS open data workshop tutorials. It is provided in three different structures (then the list from above)
author: CMS Open data group
produced with: https://opendata.cern.ch/record/12502
example usage:
- text: The use of this dataset does not require any software specific to the CMS experiment. It can be read with the ROOT package
- example link: https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/ (this is to be checked, it might refer to old file locations)
file type: nanoaod-poet

NanoAODRun1

FIles under /eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/

These are the files used in the 2022 workshop https://cms-opendata-workshop.github.io/workshop2022-lesson-run1example/

For each dataset we have

<dataset> directory with root files as produced by https://github.com/cms-opendata-analyses/NanoAODRun1ProducerTool
<dataset>_merged.root with the separate files in the <dataset> directory merged into one file

So all files go in a single derived <dataset> record:

title: <dataset> dataset in Run1 NanoAOD-like format
description: <dataset> dataset in a NanoAOD-like research-level Ntuple format for CMS Run1 data, readable with bare ROOT or other ROOT-compatible software, and containing the per-event information that is needed in most generic analyses. In contrast to the CMS NanoAOD format which is derived from MiniAOD, it is generated directly from the AOD format with completely independent code provided by the CMS open data group. Nevertheless, there is a large overlap in functionality and content between NanoAODRun1 and NanoAOD such that common analyses are possible. It is provided as a collection of root files under <dataset> directory, and in <dataset>_merged.root with the separate files in the <dataset> directory merged into one file.
author: CMS Open data group
produced with: new recid for https://github.com/cms-opendata-analyses/NanoAODRun1ProducerTool (see https://github.com/cernopendata/opendata.cern.ch/issues/3281)
example usage: see https://github.com/cernopendata/opendata.cern.ch/issues/3495
file type: nanoaod-run1

For titles and format, see https://github.com/cernopendata/opendata.cern.ch/issues/3349#issuecomment-1812439120

PFNano

Files to be moved

For each dataset, files are under /<dataset>/Run2016G-UL2016_MiniAODv2-v2_PFNanoAODv1/

The derived <dataset> records:

title: <dataset> dataset in NanoAOD format enhanced with Particle Flow candidates
description: ` dataset in NanoAOD format enhanced with Particle Flow candidates, readable with bare ROOT or other ROOT-compatible software. In addition to the default NanoAOD content, it contains the information (@jmhogan, maybe we should point to some documentation on the PF here)
author: CMS Open data group
produced with: new recid for https://github.com/cms-opendata-analyses/PFNanoProducerTool (
example usage: can point to standard NanoAOD docs
file type: nanoaod-pf

katilp commented 11 months ago

The file types of the "normal" collision and simulated data will be nanoaod and nanoaodsim, respectively. We should reflect that in the types of derived datasets so that collision data (all derived datasets starting with RunYYYYN_) will be nanoaod-<type> and simulated data nanoaodsim-<type>

katilp commented 10 months ago

The recids for the production code are

NanoAODRun1: 12505
PFNano: 12504

tpmccauley commented 1 month ago

Done

cernopendata / data-curation