RFC data model generation field schema

tiborsimko commented 5 years ago

We should agree on the new generation field schema. See one of the old proposals:

https://github.com/cernopendata/data-curation/blob/master/cod2-to-cod3/docs/MARCtoJSON_fixes_567-Generation.json

@heitorPB Can you please post an excerpt of the JSON you produced for 2012/2015?

heitorPB commented 5 years ago

The current version of the scripts generate a .json record for the datasets and one with the configuration files produced by cmsDriver.

The complete records for the full release of 2012 MC datasets (~3000 entries) are here:

They are a bit big:

$ du -h *json
113M    cms-simulated-datasets-2012-conffiles.json
22M     cms-simulated-datasets-2012.json

heitorPB commented 5 years ago

A shorter example:

https://hpascoal.web.cern.ch/hpascoal/tmp/cms-2012-1dataset.json

  "generation": {
    "steps": [
      {
        "type": "RECO-HLT",
        "release": "CMSSW_5_3_11_patch2",
        "global_tag": "START53_V19E::All",
        "configuration_files": [
          {
            "type": "cmsDriver script",
            "script": "#!/bin/bash\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\nexport SCRAM_ARCH=None\nif [ -r CMSSW_5_3_11_patch2/src ] ; then \n echo release CMSSW_5_3_11_patch2 already exists\nelse\nscram p CMSSW CMSSW_5_3_11_patch2\nfi\ncd CMSSW_5_3_11_patch2/src\neval `scram runtime -sh`\n\n\nscram b\ncd ../../\ncmsDriver.py step1 --filein \"dbs:/ADDdiLepton_LambdaT-1600_Tune4C_8TeV-pythia8/Summer12-START50_V13-v3/GEN-SIM\" --fileout file:EXO-Summer12DR53X-02487_step1.root --pileup_input \"dbs:/MinBias_TuneZ2star_8TeV-pythia6/Summer12-START50_V13-v3/GEN-SIM\" --mc --eventcontent RAWSIM --pileup 2012_Summer_50ns_PoissonOOTPU --datatier GEN-SIM-RAW --conditions START53_V19E::All --step DIGI,L1,DIGI2RAW,HLT:7E33v2 --python_filename EXO-Summer12DR53X-02487_1_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring -n 360 || exit $? ; \n\ncmsDriver.py step2 --filein file:EXO-Summer12DR53X-02487_step1.root --fileout file:EXO-Summer12DR53X-02487.root --mc --eventcontent AODSIM,DQM --datatier AODSIM,DQM --conditions START53_V19E::All --step RAW2DIGI,L1Reco,RECO,VALIDATION:validation_prod,DQM:DQMOfflinePOGMC --python_filename EXO-Summer12DR53X-02487_2_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring -n 360 || exit $? ; \n\n"
          },
          {
            "title": "conffile",
            "process": "HLT",
            "conffileID": "1937ebea238cd2fc28f3c019b0eb54ae"
          },
          {
            "title": "conffile",
            "process": "RECO",
            "conffileID": "1937ebea238cd2fc28f3c019b0f1dd0b"
          }
        ]
      },
      {
        "type": "GEN-SIM",
        "release": "CMSSW_5_1_3",
        "global_tag": "START50_V13::All",
        "configuration_files": [
          {
            "title": "cmsDriver script",
            "script": "#!/bin/bash\nsource /cvmfs/cms.cern.ch/cmsset_default.sh\nexport SCRAM_ARCH=None\nif [ -r CMSSW_5_1_3/src ] ; then \n echo release CMSSW_5_1_3 already exists\nelse\nscram p CMSSW CMSSW_5_1_3\nfi\ncd CMSSW_5_1_3/src\neval `scram runtime -sh`\ncurl  -s https://raw.githubusercontent.com/cms-sw/genproductions/V02-01-22/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py --retry 2 --create-dirs -o  Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py \n[ -s Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py ] || exit $?;\n\n\nscram b\ncd ../../\ncmsDriver.py Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py --fileout file:EXO-Summer12-01139.root --mc --eventcontent RAWSIM --pileup NoPileUp --datatier GEN-SIM --conditions START50_V13::All --beamspot Realistic8TeVCollision --step GEN,SIM --datamix NODATAMIXER --python_filename EXO-Summer12-01139_1_cfg.py --no_exec --customise Configuration/DataProcessing/Utils.addMonitoring -n 61 || exit $? ; \n\n"
          },
          {
            "url": "https://raw.githubusercontent.com/cms-sw/genproductions/V02-01-22/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py",
            "title": "Genfragment",
            "script": "import FWCore.ParameterSet.Config as cms\n\ngenerator = cms.EDFilter(\"Pythia8GeneratorFilter\",\n   comEnergy = cms.double(8000.0),\n   crossSection = cms.untracked.double(1.435),\n   filterEfficiency = cms.untracked.double(1),\n   maxEventsToPrint = cms.untracked.int32(0),\n   pythiaHepMCVerbosity = cms.untracked.bool(False),\n   pythiaPylistVerbosity = cms.untracked.int32(0),\n\n   PythiaParameters = cms.PSet(\n      processParameters = cms.vstring(\n         'Main:timesAllowErrors    = 10000',\n         'ParticleDecays:limitTau0 = on',\n         'ParticleDecays:tauMax = 10',\n         'Tune:pp 5',\n         'Tune:ee 3',\n         'PDF:pSet = 5',\n         'ExtraDimensionsLED:ffbar2llbar = on', \n         'ExtraDimensionsLED:gg2llbar = on', \n         'PhaseSpace:mHatMin = 1050',\n         'ExtraDimensionsLED:CutOffmode = 0',\n         'ExtraDimensionsLED:LambdaT = 1600'\n      ),\n      parameterSets = cms.vstring('processParameters')\n   )\n)\n\nconfigurationMetadata = cms.untracked.PSet(\n   version = cms.untracked.string('\\$Revision: 1.0 $'),\n   name = cms.untracked.string('\\$Source: /cvs_server/repositories/CMSSW/CMSSW/Configuration/GenProduction/python/EightTeV/ADD_Dilepton_LambdaT_1600_8TeV_pythia8_cff.py,v $'),\n   annotation = cms.untracked.string('2012 sample with PYTHIA8 at 8 TeV: ADD Dilepton samples with LambdaT = 1600 GeV, Tune4C, pdf: MSTW 2008 LO')\n)\n"
          },
          {
            "title": "Configuration file",
            "process": "SIM",
            "conffileID": "294fcd8902949eb73ba3813549dc621a"
          }
        ]
      }
    ],
    "description": "<p>These data were processed in several steps:</p>"
  },
  "generator": {
    "names": [
      "pythia8"
    ],
    "global_tag": "START50_V13::All"
  }

heitorPB commented 5 years ago

The global_tag and release under system_details are the ones recommended for analysis. The global_tag and release under each step are the ones used for that particular step.

Should we keep the global_tag under generator? cc @ArtemisLav @katilp

ArtemisLav commented 5 years ago

If it is indeed not needed then I don't see the point in keeping it.

heitorPB commented 5 years ago

Currently, we have the generation and generator fields. Should we have instead generation.generators (list of strings) instead of generator.names?

What about the structure of each generation.step?

ArtemisLav commented 5 years ago

Currently, we have the generation and generator fields. Should we have instead generation.generators (list of strings) instead of generator.names?

You mean merge the fields? This should be fine; we just have to make sure that generator is not also used in other record types where we wouldn't have a generation field. Just so that it doesn't just disappear from there.

What about the structure of each generation.step?

It looks good. The only thing I'm not that sure about is the conffileID.

tiborsimko commented 5 years ago

It looks good. The only thing I'm not that sure about is the conffileID.

We use cms_confdb_id name already, so we should keep the same name here.

Note also that we'll need to have record IDs for configuration files for proper linking and searching. Hence we may want to store recid here. There are basically three options:

(a) store only cms_confdb_id here and make one search query to look-up the referenced record ID for proper linking in the output template part;
(b) store only recid here and rely on the data-curation script to properly generate configuration file records with proper record IDs and ConfDB IDs (as was done for 2011 and 2012 open data releases);
(c) store both cms_confdb_id and recid here for extra safety (but also risk being open to inconsistencies should one of these change in the future -- which should be "never", since both are persistent IDs, but "never say never".)

Should we have instead generation.generators (list of strings) instead of generator.names?

It sounds good to merge them, however one could imagine having "generator" details stored under proper "step", such as powheg and pythia in SIM. In other words, let us store details about each software used in each concrete generation step. (Step LHE, software S1, environment E1, database tag T1, parameters P1 and P2; Step SIM, software S2, environment E2, database tag T2, parameters P3, etc.) Sounds closest to storing reproducible information about each step.

ArtemisLav commented 5 years ago

This also requires an update to the schema.

tiborsimko commented 5 years ago

Schema decided, cms_confdb_id implemented, recid still to be done, generators moved under step. The steps will be including under methodology field.

cernopendata / opendata.cern.ch

RFC data model generation field schema #2465