cernopendata / data-curation

Data ingestion and curation tools
GNU General Public License v2.0
18 stars 22 forks source link

CMS - create a script for derived data records #212

Closed katilp closed 1 month ago

katilp commented 1 year ago

CMS 2016 release will include several "derived data" records structurally similar to e.g. https://opendata.cern.ch/record/12341 They will be:

We should have a script template to create such records, that can be run in similar way as those for collision or MC records.

For the provenance, they will link to the parent dataset and the SW that was used to produce them (e.g. Run1 Nano: https://github.com/cernopendata/opendata.cern.ch/issues/3281). Both will be available as CODP records. So need for extended provenance listing as it is already available in the parent dataset record.

For the variable description, these records can link to listings of this type. This html files (one per type of production) should be hosted on the OD portal.

In the scripts, all metadata variables should be collected to the start of the script, for the ease of reuse.

nancyhamdan commented 1 year ago

Can refer to this script from cms-2012-event-display-files as a starting point for new script for derived datasets, taking into account the following notes:

dataset: name: "/BTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 123456789 dataset: name: "/CTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 456 dataset: name:" /DTag/Run2012C-22Jan2013-V1/AOD" number_of_events: 456324234

katilp commented 1 year ago

For the number of events, you can use the following (if running where ROOT is available)

NanoAODRun1 and PFNano:

import ROOT
myfile = ROOT.TFile.Open("http://opendata.cern.ch/eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2012C_SingleMu/FF6C9C2D-3B37-43D7-A9B0-043CB2AC8202.root")
myfile.Events.GetEntries()

In the older versions, the event number might appear as a "long" integer, e.g. 563709L, in that case, int(myfile.Events.GetEntries())

POET:

The POET output has a different structure, and there are two versions of it:

The number of events is the same in both cases.

katilp commented 11 months ago

Further details for the three types of derived data :

POET

Files under /eos/opendata/cms/derived-data/POET/23-Jul-22/

These are the files used in the 2022 workshop lesson https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/

For each dataset, we have:

e.g.

RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat.root

Finally, no reason to leave out the merged file, we can as well have it in the record. So all files go in a single derived <dataset> record:

NanoAODRun1

FIles under /eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/

These are the files used in the 2022 workshop https://cms-opendata-workshop.github.io/workshop2022-lesson-run1example/

For each dataset we have

So all files go in a single derived <dataset> record:

For titles and format, see https://github.com/cernopendata/opendata.cern.ch/issues/3349#issuecomment-1812439120

PFNano

Files to be moved

For each dataset, files are under /<dataset>/Run2016G-UL2016_MiniAODv2-v2_PFNanoAODv1/

The derived <dataset> records:

katilp commented 11 months ago

The file types of the "normal" collision and simulated data will be nanoaod and nanoaodsim, respectively. We should reflect that in the types of derived datasets so that collision data (all derived datasets starting with RunYYYYN_) will be nanoaod-<type> and simulated data nanoaodsim-<type>

katilp commented 10 months ago

The recids for the production code are

tpmccauley commented 1 month ago

Done