Closed tiborsimko closed 5 years ago
The current version of the scripts generate a .json
record for the datasets and one with the configuration files produced by cmsDriver
.
The complete records for the full release of 2012 MC datasets (~3000 entries) are here:
They are a bit big:
$ du -h *json
113M cms-simulated-datasets-2012-conffiles.json
22M cms-simulated-datasets-2012.json
A shorter example:
https://hpascoal.web.cern.ch/hpascoal/tmp/cms-2012-1dataset.json
"generation": {
"steps": [
{
"type": "RECO-HLT",
"release": "CMSSW_5_3_11_patch2",
"global_tag": "START53_V19E::All",
"configuration_files": [
{
"type": "cmsDriver script",
"script": "#!/bin/bash\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\nexport SCRAM_ARCH=None\nif [ -r CMSSW_5_3_11_patch2/src ] ; then \n echo release CMSSW_5_3_11_patch2 already exists\nelse\nscram p CMSSW CMSSW_5_3_11_patch2\nfi\ncd CMSSW_5_3_11_patch2/src\neval `scram runtime -sh`\n\n\nscram b\ncd ../../\ncmsDriver.py step1 --filein \"dbs:/ADDdiLepton_LambdaT-1600_Tune4C_8TeV-pythia8/Summer12-START50_V13-v3/GEN-SIM\" --fileout file:EXO-Summer12DR53X-02487_step1.root --pileup_input \"dbs:/MinBias_TuneZ2star_8TeV-pythia6/Summer12-START50_V13-v3/GEN-SIM\" --mc --eventcontent RAWSIM --pileup 2012_Summer_50ns_PoissonOOTPU --datatier GEN-SIM-RAW --conditions START53_V19E::All --step DIGI,L1,DIGI2RAW,HLT:7E33v2 --python_filename EXO-Summer12DR53X-02487_1_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring -n 360 || exit $? ; \n\ncmsDriver.py step2 --filein file:EXO-Summer12DR53X-02487_step1.root --fileout file:EXO-Summer12DR53X-02487.root --mc --eventcontent AODSIM,DQM --datatier AODSIM,DQM --conditions START53_V19E::All --step RAW2DIGI,L1Reco,RECO,VALIDATION:validation_prod,DQM:DQMOfflinePOGMC --python_filename EXO-Summer12DR53X-02487_2_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring -n 360 || exit $? ; \n\n"
},
{
"title": "conffile",
"process": "HLT",
"conffileID": "1937ebea238cd2fc28f3c019b0eb54ae"
},
{
"title": "conffile",
"process": "RECO",
"conffileID": "1937ebea238cd2fc28f3c019b0f1dd0b"
}
]
},
{
"type": "GEN-SIM",
"release": "CMSSW_5_1_3",
"global_tag": "START50_V13::All",
"configuration_files": [
{
"title": "cmsDriver script",
"script": "#!/bin/bash\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\nexport SCRAM_ARCH=None\nif [ -r CMSSW_5_1_3/src ] ; then \n echo release CMSSW_5_1_3 already exists\nelse\nscram p CMSSW CMSSW_5_1_3\nfi\ncd CMSSW_5_1_3/src\neval `scram runtime -sh`\ncurl -s https://raw.githubusercontent.com/cms-sw/genproductions/V02-01-22/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py --retry 2 --create-dirs -o Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py \n[ -s Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py ] || exit $?;\n\n\nscram b\ncd ../../\ncmsDriver.py Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py --fileout file:EXO-Summer12-01139.root --mc --eventcontent RAWSIM --pileup NoPileUp --datatier GEN-SIM --conditions START50_V13::All --beamspot Realistic8TeVCollision --step GEN,SIM --datamix NODATAMIXER --python_filename EXO-Summer12-01139_1_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring -n 61 || exit $? ; \n\n"
},
{
"url": "https://raw.githubusercontent.com/cms-sw/genproductions/V02-01-22/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py",
"title": "Genfragment",
"script": "import FWCore.ParameterSet.Config as cms\n\ngenerator = cms.EDFilter(\"Pythia8GeneratorFilter\",\n comEnergy = cms.double(8000.0),\n crossSection = cms.untracked.double(1.435),\n filterEfficiency = cms.untracked.double(1),\n maxEventsToPrint = cms.untracked.int32(0),\n pythiaHepMCVerbosity = cms.untracked.bool(False),\n pythiaPylistVerbosity = cms.untracked.int32(0),\n\n PythiaParameters = cms.PSet(\n processParameters = cms.vstring(\n 'Main:timesAllowErrors = 10000',\n 'ParticleDecays:limitTau0 = on',\n 'ParticleDecays:tauMax = 10',\n 'Tune:pp 5',\n 'Tune:ee 3',\n 'PDF:pSet = 5',\n 'ExtraDimensionsLED:ffbar2llbar = on', \n 'ExtraDimensionsLED:gg2llbar = on', \n 'PhaseSpace:mHatMin = 1050',\n 'ExtraDimensionsLED:CutOffmode = 0',\n 'ExtraDimensionsLED:LambdaT = 1600'\n ),\n parameterSets = cms.vstring('processParameters')\n )\n)\n\nconfigurationMetadata = cms.untracked.PSet(\n version = cms.untracked.string('\\$Revision: 1.0 $'),\n name = cms.untracked.string('\\$Source: /cvs_server/repositories/CMSSW/CMSSW/Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py,v $'),\n annotation = cms.untracked.string('2012 sample with PYTHIA8 at 8 TeV: ADD Dilepton samples with LambdaT = 1600 GeV, Tune4C, pdf: MSTW 2008 LO')\n)\n"
},
{
"title": "Configuration file",
"process": "SIM",
"conffileID": "294fcd8902949eb73ba3813549dc621a"
}
]
}
],
"description": "<p>These data were processed in several steps:</p>"
},
"generator": {
"names": [
"pythia8"
],
"global_tag": "START50_V13::All"
}
The global_tag
and release
under system_details
are the ones recommended for analysis. The global_tag
and release
under each step
are the ones used for that particular step.
Should we keep the global_tag
under generator
?
cc @ArtemisLav @katilp
If it is indeed not needed then I don't see the point in keeping it.
Currently, we have the generation
and generator
fields. Should we have instead generation.generators
(list of strings) instead of generator.names
?
What about the structure of each generation.step
?
Currently, we have the
generation
andgenerator
fields. Should we have insteadgeneration.generators
(list of strings) instead ofgenerator.names
?
You mean merge the fields? This should be fine; we just have to make sure that generator
is not also used in other record types where we wouldn't have a generation
field. Just so that it doesn't just disappear from there.
What about the structure of each generation.step?
It looks good. The only thing I'm not that sure about is the conffileID
.
It looks good. The only thing I'm not that sure about is the conffileID.
We use cms_confdb_id
name already, so we should keep the same name here.
Note also that we'll need to have record IDs for configuration files for proper linking and searching. Hence we may want to store recid
here. There are basically three options:
(a) store only cms_confdb_id
here and make one search query to look-up the referenced record ID for proper linking in the output template part;
(b) store only recid
here and rely on the data-curation script to properly generate configuration file records with proper record IDs and ConfDB IDs (as was done for 2011 and 2012 open data releases);
(c) store both cms_confdb_id
and recid
here for extra safety (but also risk being open to inconsistencies should one of these change in the future -- which should be "never", since both are persistent IDs, but "never say never".)
Should we have instead generation.generators (list of strings) instead of generator.names?
It sounds good to merge them, however one could imagine having "generator" details stored under proper "step", such as powheg and pythia in SIM. In other words, let us store details about each software used in each concrete generation step. (Step LHE, software S1, environment E1, database tag T1, parameters P1 and P2; Step SIM, software S2, environment E2, database tag T2, parameters P3, etc.) Sounds closest to storing reproducible information about each step.
This also requires an update to the schema.
Schema decided, cms_confdb_id
implemented, recid
still to be done, generators
moved under step.
The steps
will be including under methodology
field.
We should agree on the new generation field schema. See one of the old proposals:
@heitorPB Can you please post an excerpt of the JSON you produced for 2012/2015?