Closed katilp closed 1 month ago
Can refer to this script from cms-2012-event-display-files as a starting point for new script for derived datasets, taking into account the following notes:
eos ls /eos/opendata/cms/derived-data/
e.g. eos ls /eos/opendata/cms/derived-data/POET/23-Jul-22/
. "merged" files should be excludedauthors
field will be set as "CMS Open Data Group" for all datasetsnumber_files
can be extracted from the derived-data directory in eosmethodology
corresponding to the "How were these data selected" section will differ from one type to another, can be changed in the Jinja html template with some conditional logic. File linked in this section will be a link to a corresponding software recordtitle
of each type of datasetdescription
under abstract
can be extracted from a csv file that has datasets' descriptions. We could have a description template for each type of derived datasetsdataset_semantics
corresponding to the "Dataset characteristics" section can also be extracted as input from a csv file listing them for each datasetdescription
and dataset_semantics
from csv files, and to easily be able to configure the script with any other necessary hard coded values, we could have a yaml file that has these values and the script could take this file as input. The yaml file could have similar structure to this:
common values:
collision_energy: "0.2TeV"
keywords:
- education
- outreach
description: >
This is a very long description saying this and that. It can even have many lines. So it could be quite comfortable to
enter desired dataset descriptions even if they are really long.
dataset: name: "/BTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 123456789 dataset: name: "/CTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 456 dataset: name:" /DTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 456324234
For the number of events, you can use the following (if running where ROOT is available)
import ROOT
myfile = ROOT.TFile.Open("http://opendata.cern.ch/eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2012C_SingleMu/FF6C9C2D-3B37-43D7-A9B0-043CB2AC8202.root")
myfile.Events.GetEntries()
In the older versions, the event number might appear as a "long" integer, e.g. 563709L, in that case, int(myfile.Events.GetEntries())
The POET output has a different structure, and there are two versions of it:
events
tree : myfile.events.GetEntries()
(NB: lower case e
)myfile.myelectrons.Events.GetEntries()
(NB: upper case E
)The number of events is the same in both cases.
Further details for the three types of derived data :
Files under /eos/opendata/cms/derived-data/POET/23-Jul-22/
These are the files used in the 2022 workshop lesson https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/
For each dataset, we have:
<dataset>
directory with root files as a direct output of POET (separate trees for each object)<dataset>_flat
directory with root files "flattened" to a single tree, as required when used as input to coffea
with nanoevents
schema<dataset>_flat.root
with the separate files in the <dataset>_flat
directory merged into one filee.g.
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat.root
Finally, no reason to leave out the merged file, we can as well have it in the record.
So all files go in a single derived <dataset>
record:
<dataset>
dataset in reduced NanoAOD-like format<dataset>
dataset, readable with bare ROOT or other ROOT-compatible software. It was produced for the CMS open data workshop tutorials. It is provided in three different structures (then the list from above)FIles under /eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/
These are the files used in the 2022 workshop https://cms-opendata-workshop.github.io/workshop2022-lesson-run1example/
For each dataset we have
<dataset>
directory with root files as produced by https://github.com/cms-opendata-analyses/NanoAODRun1ProducerTool<dataset>_merged.root
with the separate files in the <dataset>
directory merged into one fileSo all files go in a single derived <dataset>
record:
<dataset>
dataset in Run1 NanoAOD-like format <dataset>
dataset in a NanoAOD-like research-level Ntuple format for CMS Run1 data, readable with bare ROOT or other ROOT-compatible software, and containing the per-event information that is needed in most generic analyses. In contrast to the CMS NanoAOD format which is derived from MiniAOD, it is generated directly from the AOD format with completely independent code provided by the CMS open data group. Nevertheless, there is a large overlap in functionality and content between NanoAODRun1 and NanoAOD such that common analyses are possible. It is provided as a collection of root files under <dataset>
directory, and in <dataset>_merged.root
with the separate files in the <dataset>
directory merged into one file.For titles and format, see https://github.com/cernopendata/opendata.cern.ch/issues/3349#issuecomment-1812439120
Files to be moved
For each dataset, files are under /<dataset>/Run2016G-UL2016_MiniAODv2-v2_PFNanoAODv1/
The derived <dataset>
records:
<dataset>
dataset in NanoAOD format enhanced with Particle Flow candidates The file types of the "normal" collision and simulated data will be nanoaod and nanoaodsim, respectively.
We should reflect that in the types of derived datasets so that collision data (all derived datasets starting with RunYYYYN_
) will be nanoaod-<type>
and simulated data nanoaodsim-<type>
The recids for the production code are
Done
CMS 2016 release will include several "derived data" records structurally similar to e.g. https://opendata.cern.ch/record/12341 They will be:
We should have a script template to create such records, that can be run in similar way as those for collision or MC records.
For the provenance, they will link to the parent dataset and the SW that was used to produce them (e.g. Run1 Nano: https://github.com/cernopendata/opendata.cern.ch/issues/3281). Both will be available as CODP records. So need for extended provenance listing as it is already available in the parent dataset record.
For the variable description, these records can link to listings of this type. This html files (one per type of production) should be hosted on the OD portal.
In the scripts, all metadata variables should be collected to the start of the script, for the ease of reuse.