cernopendata / data-curation

Data ingestion and curation tools
GNU General Public License v2.0
18 stars 22 forks source link

CMS: updates for the MC provenance query 2016 #182

Open katilp opened 1 year ago

katilp commented 1 year ago

The current script gets the provenance information as follows

As the processing scheme has changed from UL processing (no input datasets before AODSIM as they were transient) this won't work anymore.

The query flow should be changed to go directly to the chain:

For an example dataset /ADDmonoPhoton_MD-1_d-3_TuneCP5_13TeV-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM:

On the web GUI:

Query by the output file name:

https://cms-pdmv.cern.ch/mcm/requests?produce=%2FADDmonoPhoton_MD-1_d-3_TuneCP5_13TeV-pythia8%2FRunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2%2FNANOAODSIM&page=0&shown=140737488355327

image

https://cms-pdmv.cern.ch/mcm/chained_requests?contains=EXO-RunIISummer20UL16NanoAODv9-00205&page=0

image

then for each request of the query and get the dicts in the respective pages.

On the command line

Using pred_id from das

$ dasgoclient -query="dataset=/ADDmonoPhoton_MD-1_d-3_TuneCP5_13TeV-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM"  -json | jq .[].dataset[].prep_id
"EXO-RunIISummer20UL16NanoAODv9-00205"
"EXO-RunIISummer20UL16NanoAODv9-00205"
null
"EXO-RunIISummer20UL16NanoAODv9-00205"
katilp commented 1 year ago

Start with example datasets:

/ADDmonoPhoton_MD-1_d-3_TuneCP5_13TeV-pythia8/RunIISummer20UL16NanoAODv9-106X_mcRun2_asymptotic_v17-v2/NANOAODSIM /BBH_HToJPsiG_JPsiToMuMu_TuneCP5_13TeV-madgraph-pythia8/RunIISummer20UL16MiniAODv2-106X_mcRun2_asymptotic_v17-v1/MINIAODSIM

Expected changes in the scripts:

Called from interface.py:

lhe_generators.py is called separately (see e.g. 2015 readme):

katilp commented 11 months ago
katilp commented 11 months ago

For the record, DIGIPremix step has a 22 Mb config file containing the list of files in the pile-up Premix datasets. For the two test datasets that I use they differ only in naming:

$ ls -l inputs/config-store/
total 44823
-rw-r--r--. 1 kati zh     5917 Oct 19 14:29 086c69c1b826c78c43be2aa70d80e01e.configFile
-rw-r--r--. 1 kati zh     8671 Oct 19 14:29 160526781ab6242177672ffc68eb5568.configFile
-rw-r--r--. 1 kati zh     4319 Oct 19 14:29 481ced9502ea985a73dc7bca8c9ea7a9.configFile
-rw-r--r--. 1 kati zh     4349 Oct 19 14:29 528bf7046404f48fa330df88a6a92123.configFile
-rw-r--r--. 1 kati zh 22907660 Oct 19 14:29 528bf7046404f48fa330df88a6a9594b.configFile
-rw-r--r--. 1 kati zh     4521 Oct 19 14:29 528bf7046404f48fa330df88a6a99098.configFile
-rw-r--r--. 1 kati zh     4850 Oct 19 14:29 528bf7046404f48fa330df88a6a9a53b.configFile
-rw-r--r--. 1 kati zh     9324 Oct 19 14:29 70368b76504c9adbeb8bd6f29a1b6dee.configFile
-rw-r--r--. 1 kati zh    11520 Oct 19 14:29 80266517fa91333a47ed2d1cc3eeddf0.configFile
-rw-r--r--. 1 kati zh    12957 Oct 19 14:29 c8dc83abb237e289eae3cfefea871409.configFile
-rw-r--r--. 1 kati zh     4349 Oct 19 14:29 edf4aef02c2af29980365f11a8f78f77.configFile
-rw-r--r--. 1 kati zh 22907660 Oct 19 14:29 edf4aef02c2af29980365f11a8faa478.configFile
-rw-r--r--. 1 kati zh     4521 Oct 19 14:29 edf4aef02c2af29980365f11a8fade0c.configFile
-rw-r--r--. 1 kati zh     4850 Oct 19 14:29 edf4aef02c2af29980365f11a8fbd0b0.configFile

with

-bash-4.2$ diff inputs/config-store/528bf7046404f48fa330df88a6a9594b.configFile inputs/config-store/edf4aef02c2af29980365f11a8faa478.configFile
5c5
< # with command line options: --python_filename TOP-RunIISummer20UL16DIGIPremix-00281_1_cfg.py --eventcontent PREMIXRAW --customise Configuration/DataProcessing/Utils.addMonitoring --datatier GEN-SIM-DIGI --fileout file:TOP-RunIISummer20UL16DIGIPremix-00281.root --pileup_input dbs:/Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX --conditions 106X_mcRun2_asymptotic_v13 --step DIGI,DATAMIX,L1,DIGI2RAW --procModifiers premix_stage2 --nThreads 4 --geometry DB:Extended --filein file:TOP-RunIISummer20UL16SIM-00281.root --datamix PreMix --era Run2_2016 --runUnscheduled --no_exec --mc -n 5807
---
> # with command line options: --python_filename TOP-RunIISummer20UL16DIGIPremix-00291_1_cfg.py --eventcontent PREMIXRAW --customise Configuration/DataProcessing/Utils.addMonitoring --datatier GEN-SIM-DIGI --fileout file:TOP-RunIISummer20UL16DIGIPremix-00291.root --pileup_input dbs:/Neutrino_E-10_gun/RunIISummer20ULPrePremix-UL16_106X_mcRun2_asymptotic_v13-v1/PREMIX --conditions 106X_mcRun2_asymptotic_v13 --step DIGI,DATAMIX,L1,DIGI2RAW --procModifiers premix_stage2 --nThreads 4 --geometry DB:Extended --filein file:TOP-RunIISummer20UL16SIM-00291.root --datamix PreMix --era Run2_2016 --runUnscheduled --no_exec --mc -n 5081
29c29
<     input = cms.untracked.int32(5807)
---
>     input = cms.untracked.int32(5081)
35c35
<     fileNames = cms.untracked.vstring('file:TOP-RunIISummer20UL16SIM-00281.root'),
---
>     fileNames = cms.untracked.vstring('file:TOP-RunIISummer20UL16SIM-00291.root'),
64c64
<     annotation = cms.untracked.string('--python_filename nevts:5807'),
---
>     annotation = cms.untracked.string('--python_filename nevts:5081'),
76c76
<     fileName = cms.untracked.string('file:TOP-RunIISummer20UL16DIGIPremix-00281.root'),
---
>     fileName = cms.untracked.string('file:TOP-RunIISummer20UL16DIGIPremix-00291.root'),

This is a 22M file and if taken for 40k MC datasets, it will result in 880 G disk space, so we can do it differently...

katilp commented 11 months ago

To do:

katilp commented 8 months ago

Updates to LHE generator search

Check which inputs are passed to the job in runcmsgrid.sh

Reminder:

tar -tvf <gridpack name>.tgz: lists the contents of the archive tar -xf <gridpack name>.tgz: extracts all files tar -xf <gridpack name>.tgz <file name>: extracts one file only

Note: