Prod-v18.2.X - Githubissues

Issue tracking the v18.2.X production/releases. Continues from #68 and #66

Working on branch prod-v18.2.X

git tag -m "Test production 18.2.X" prod-v18.2.0
python crab/make_crab_script.py -s SingleMu_2018A H2Mu_gg ZJets_MG_1
# created the dir "/afs/cern.ch/user/b/bortigno/workspace/x2mm18_10211p1/src/Ntupliser/DiMuons/crab_2019_11_29_17_54-prod-v18.2.0"

Now testing

crab submit -c ./crab_2019_11_29_17_54-prod-v18.2.0/configs/H2Mu_gg.py --dryrun

Succesful.

crab proceed -d logs/crab_H2Mu_gg_2019_11_29_17_54_prod2018_prod-v18p2p0
./crab_2019_11_29_17_54-prod-v18.2.0/submit_all.sh

Checking the status of the jobs I get a strange error:

crab status -d logs/crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0
CRAB project directory:     /afs/cern.ch/work/b/bortigno/x2mm18_10211p1/src/Ntupliser/DiMuons/logs/crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0
Task name:          191129_174623:bortigno_crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0
Grid scheduler - Task Worker:   N/A yet - crab-prod-tw01
Status on the CRAB server:  SUBMITFAILED
Task URL to use for HELP:   https://cmsweb.cern.ch/crabserver/ui/task/191129_174623%3Abortigno_crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0
Dashboard monitoring URL:   https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=bortigno&var-task=191129_174623%3Abortigno_crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0
In case of issues with the dashboard, please provide feedback to hn-cms-computing-tools@cern.ch
Failure message from server:    Problem handling 191129_174623:bortigno_crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0 because of (6, 'Could not resolve: cms-cric.cern.ch (Successful completion)') failure, traceback follows
                Traceback (most recent call last):
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/TaskWorker/Actions/Handler.py", line 77, in executeAction
                    output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 99, in execute
                    result = self.executeInternal(*args, **kwargs)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 259, in executeInternal
                    tempDir = kwargs['tempDir'])
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/TaskWorker/Actions/DataDiscovery.py", line 72, in formatOutput
                    wmfile['locations'] = resourceCatalog.PNNstoPSNs(locations[wmfile['block']])
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/CRIC/CRIC.py", line 154, in PNNstoPSNs
                    mapping = self._CRICSiteQuery(callname='data-processing')
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/CRIC/CRIC.py", line 86, in _CRICSiteQuery
                    sitenames = self._getResult(uri, callname=callname, args=extraArgs)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/CRIC/CRIC.py", line 59, in _getResult
                    data = self.refreshCache(cachedApi, apiUrl)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/Service.py", line 205, in refreshCache
                    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/Service.py", line 282, in getData
                    contentType=contentType)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 150, in makeRequest
                    encoder, decoder, contentType)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/Requests.py", line 175, in makeRequest_pycurl
                    ckey=ckey, cert=cert, capath=capath, decode=decoder)
                  File "/data/srv/TaskManager/3.3.1911.rc3.patch1/slc7_amd64_gcc630/cms/crabtaskworker/3.3.1911.rc3.patch1/lib/python2.7/site-packages/WMCore/Services/pycurl_manager.py", line 235, in request
                    curl.perform()
                error: (6, 'Could not resolve: cms-cric.cern.ch (Successful completion)')

Log file is /afs/cern.ch/work/b/bortigno/x2mm18_10211p1/src/Ntupliser/DiMuons/logs/crab_SingleMu_2018A_2019_11_29_17_54_prod2018_prod-v18p2p0/crab.log

Will continue tomorrow.

The GluGluH sample partially succeeded, while the other didn't.

Looking at the job exit code of the ZJets jobs in grafana most of them are coming out with 50664 - Application terminated by wrapper because using too much Wall Clock time so I need to reduce the number of events per job (and maybe later check if there is anything that is taking too much time)

git commit -m "lower the number of files per jobs after 'too much wall-time' error occurred in test run" crab/make_crab_script.py
  [prod-v18.2.X 0750fe7] lower the number of files per jobs after 'too much wall-time' error occurred in test run
   1 file changed, 1 insertion(+), 1 deletion(-)
python crab/make_crab_script.py -s SingleMu_2018A H2Mu_gg ZJets_MG_1
./crab_2019_12_03_14_54-prod-v18.2.0-1-g0750fe7/submit_all.sh

Sending the jobs was successful. Now checking the status: "disk quota exceeded" ! Ok, in the meanwhile fixing the check-all part of the script ( to be ported to 2016 and 2017)

git commit -m "BUGFIX: fix check-all part of the script" crab/make_crab_script.py
[prod-v18.2.X 445c468] BUGFIX: fix check-all part of the script

The problem is in the number of files and not in the space

for d in /eos/cms/store/user/bortigno/*; do echo ${d}; find ${d} -type f | wc -l; done
/eos/cms/store/user/bortigno/copythis.txt
1
/eos/cms/store/user/bortigno/crab3checkwrite_20171023_180825
0
/eos/cms/store/user/bortigno/forRegression
20
/eos/cms/store/user/bortigno/fromJakobs
6
/eos/cms/store/user/bortigno/genproduction
2
/eos/cms/store/user/bortigno/h2mm
8934
/eos/cms/store/user/bortigno/l1dpg
12
/eos/cms/store/user/bortigno/MadGraph
90
/eos/cms/store/user/bortigno/mc_genproduction
0
/eos/cms/store/user/bortigno/x2mumu_histos
369
/eos/cms/store/user/bortigno/Zd150Ntuples
566

for d in /eos/cms/store/user/bortigno/h2mm/ntuples/2016/94X_v3/*; do echo ${d}; find ${d} -type f | wc -l; done
/eos/cms/store/user/bortigno/h2mm/ntuples/2016/94X_v3/prod-v16.0.7.ext.skim3l
0
/eos/cms/store/user/bortigno/h2mm/ntuples/2016/94X_v3/prod-v16.0.7.ext.skim3l-1-g9b59adb
157
/eos/cms/store/user/bortigno/h2mm/ntuples/2016/94X_v3/prod-v16.0.7.skim3l
0
/eos/cms/store/user/bortigno/h2mm/ntuples/2016/94X_v3/prod-v16.2.0
315
/eos/cms/store/user/bortigno/h2mm/ntuples/2016/94X_v3/STR
4233

for d in /eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.*; do echo ${d}; find ${d} -type f | wc -l; done
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l
3557
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l-1-g01cc34c
650
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l-2-g069120f
0
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.2.0
22
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.2.0-1-g0750fe7
0

for d in /eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/*; do echo ${d}; find ${d} -type f | wc -l; done
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/DYJetsToLL_M-105To160_TuneCP5_PSweights_13TeV-amcatnloFXFX-pythia8
125
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/DYJetsToLL_M-105To160_TuneCP5_PSweights_13TeV-madgraphMLM-pythia8
162
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8
163
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluHToMuMu_M125_TuneCP5_PSweights_13TeV_amcatnloFXFX_pythia8
16
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluToContinToZZTo2e2mu_13TeV_MCFM701_pythia8
4
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluToContinToZZTo2e2tau_13TeV_MCFM701_pythia8
4
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluToContinToZZTo2mu2nu_13TeV_MCFM701_pythia8
1
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluToContinToZZTo4mu_13TeV_MCFM701_pythia8
10
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/GluGluToContinToZZTo4tau_13TeV_MCFM701_pythia8
11
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/SingleMuon
2277
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ST_tW_antitop_5f_NoFullyHadronicDecays_TuneCP5_13TeV-powheg-pythia8
6
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ST_tW_top_5f_NoFullyHadronicDecays_TuneCP5_13TeV-powheg-pythia8
7
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/THQ_4f_Hincl_13TeV_madgraph_pythia8
22
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/THW_5f_Hincl_13TeV_madgraph_pythia8
21
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TT_DiLept_TuneCP5_13TeV-amcatnlo-pythia8
16
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ttHToMuMu_M120_TuneCP5_PSweights_13TeV-powheg-pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ttHToMuMu_M125_TuneCP5_PSweights_13TeV-powheg-pythia8
6
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ttHToMuMu_M130_TuneCP5_PSweights_13TeV-powheg-pythia8
6
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ttHToNonbb_M125_TuneCP5_13TeV-powheg-pythia8
15
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTJets_DiLept_TuneCP5_13TeV-madgraphMLM-pythia8
169
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8
191
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTWJetsToLNu_TuneCP5_13TeV-amcatnloFXFX-madspin-pythia8
2
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTWW_TuneCP5_13TeV-madgraph-pythia8
3
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTZToLL_M-1to10_TuneCP5_13TeV-amcatnlo-pythia8
4
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTZToLLNuNu_M-10_TuneCP5_13TeV-amcatnlo-pythia8
4
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/VBFHToMuMu_M120_TuneCP5_PSweights_13TeV_amcatnlo_pythia8
4
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/VBFHToMuMu_M125_TuneCP5_PSweights_13TeV_amcatnlo_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/VBFHToMuMu_M130_TuneCP5_PSweights_13TeV_amcatnlo_pythia8
6
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WminusH_HToMuMu_WToAll_M120_TuneCP5_PSweights_13TeV_powheg_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WminusH_HToMuMu_WToAll_M125_TuneCP5_PSweights_13TeV_powheg_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WminusH_HToMuMu_WToAll_M130_TuneCP5_PSweights_13TeV_powheg_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WplusH_HToMuMu_WToAll_M125_TuneCP5_PSweights_13TeV_powheg_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WWTo2L2Nu_NNPDF31_TuneCP5_13TeV-powheg-pythia8
36
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WWW_4F_TuneCP5_13TeV-amcatnlo-pythia8
3
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WWZ_TuneCP5_13TeV-amcatnlo-pythia8
3
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WZTo2L2Q_13TeV_amcatnloFXFX_madspin_pythia8
101
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WZTo3LNu_TuneCP5_13TeV-amcatnloFXFX-pythia8
56
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/WZZ_TuneCP5_13TeV-amcatnlo-pythia8
3
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ZH_HToMuMu_ZToAll_M120_TuneCP5_PSweights_13TeV_powheg_pythia8
4
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ZH_HToMuMu_ZToAll_M125_TuneCP5_PSweights_13TeV_powheg_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ZH_HToMuMu_ZToAll_M130_TuneCP5_PSweights_13TeV_powheg_pythia8
5
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ZZTo2L2Q_13TeV_amcatnloFXFX_madspin_pythia8
20
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ZZTo4L_TuneCP5_13TeV_powheg_pythia8
28
/eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/ZZZ_TuneCP5_13TeV-amcatnlo-pythia8
3

So the data file is the largest sample for number of files. Need to update the config integrating more lumi-sections per job.

I checked with Xunwu and he is using the 2018 ntuples in the phys_higgs area, but the 2016 ntuples from my user area so I removed some of the 2018 samples to make some space:

rm -rf /eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8
rm -rf /eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/DYJetsToLL_M-105To160_TuneCP5_PSweights_13TeV-madgraphMLM-pythia8
rm -rf /eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l/TTJets_DiLept_TuneCP5_13TeV-madgraphMLM-pythia8

Now after clearing some space in my dir I am launching the new test production

python crab/make_crab_script.py -s SingleMu_2018A H2Mu_gg ZJets_MG_1
./crab_2019_12_03_17_51-prod-v18.2.0-2-g445c468/submit_all.sh 
./crab_2019_12_03_17_51-prod-v18.2.0-2-g445c468/check_all.sh

Success! Also the check_all.sh script is now working.

Removing the 2018 production that is replicated in the group space

rm -rf /eos/cms/store/user/bortigno/h2mm/ntuples/2018/102X/prod-v18.1.6.skim3l*

I want now to reduce the number of files for data. With "100" I get 2200 files for 2018. I think I can reduce it by a factor 3. I am testing it now.

python crab/make_crab_script.py -s SingleMu_2018A H2Mu_gg ZJets_MG_1
crab submit -c crab_2019_12_05_10_52-prod-v18.2.0-2-g445c468/configs/SingleMu_2018A.py --dryrun

Will use CRAB configuration file crab_2019_12_05_10_52-prod-v18.2.0-2-g445c468/configs/SingleMu_2018A.py Importing CMSSW configuration crab_2019_12_05_10_52-prod-v18.2.0-2-g445c468/analyzers/SingleMu_2018A.py Finished importing CMSSW configuration crab_2019_12_05_10_52-prod-v18.2.0-2-g445c468/analyzers/SingleMu_2018A.py Sending the request to the server at cmsweb.cern.ch Success: Your task has been delivered to the prod CRAB3 server. Waiting for task to be processed Checking task status Task status: NEW Please wait... Task status: HOLDING Please wait... Task status: QUEUED Please wait... Task status: QUEUED Please wait... Task status: UPLOADED

Creating temporary directory for dry run sandbox in /tmp/bortigno/tmpPg7uL2 Executing test, please wait...

Using LumiBased splitting Task consists of 185 jobs to process 55394 lumis The longest job will process 300 lumis, with an estimated processing time of 45499 minutes The average job will process 299 lumis, with an estimated processing time of 25733 minutes The shortest job will process 194 lumis, with an estimated processing time of 11168 minutes The estimated memory requirement is 977 MB

Timing quantities given below are ESTIMATES. Keep in mind that external factors such as transient file-access delays can reduce estimate reliability.

For ~480 minute jobs, use: Data.unitsPerJob = 5 You will need to submit a new task

Dry run requested: task paused To continue processing, use 'crab proceed'

Log file is /afs/cern.ch/work/b/bortigno/x2mm18_10211p1/src/Ntupliser/DiMuons/logs/crab_SingleMu_2018A_2019_12_05_10_52_prod2018_prod-v18p2p0-2-g445c468/crab.log

So 300 seems excessive in terms of time 45500 minutes are way too long.

The job slitting seems to be suboptimal. Trying now with "Automatic" splitting.

git commit -m "Trying automated job splitting" crab/make_crab_script.py

[prod-v18.2.X 8584a94] Trying automated job splitting 1 file changed, 3 insertions(+), 3 deletions(-)

crab submit -c ./crab_2019_12_05_16_19-prod-v18.2.0-3-g8584a94/configs/SingleMu_2018A.py --dryrun

The 'dryrun' option is not compatible with the 'Automatic' splitting mode (default).

Ok, then I 'll submit them to test it.

./crab_2019_12_05_16_19-prod-v18.2.0-3-g8584a94/submit_all.sh

Invalid CRAB configuration: In case of Automatic splitting, the Data.unitsPerJob parameter must be in the [180, 2700] minutes range. You asked for 5 minutes.

So I am updating to 180 for the test and 270 for the production - which I think it could be updated to a larger number, something like one day (1440 minutes) or

git commit --amend --no-edit

[prod-v18.2.X 0694a55] Trying automated job splitting Date: Thu Dec 5 16:18:49 2019 +0100 1 file changed, 6 insertions(+), 6 deletions(-)

./crab_2019_12_05_17_27-prod-v18.2.0-3-g0694a55/submit_all.sh 
./crab_2019_12_05_17_27-prod-v18.2.0-3-g0694a55/check_all.sh

Monitoring links: SingleMu_2018A H2Mu_gg ZJets_MG

They are submitted but when I check the status I see something unfamiliar, so indeed when the splitting is "Automatic" the status summary is a bit different: from this twiki

The Data.splitting parameter has now a default value: 'Automatic'. With such a setting the task processing is split into three stages:

A "probe" stage, where some probe jobs are submitted to estimate the event throughput of the CMSSW parameter-set configuration provided by the user in the JobType.psetName parameter and possible further arguments. Probe jobs have a job id of the form 0-[1,2,3,...], they can not be resubmitted and the task will fail if none of the probe jobs complete successfully. The output files transfer is disabled for probe jobs.

A "main" stage, very similar to the conventional stage for other splitting modes, in which a number of main jobs (automatically determined by the probe stage) will process the dataset. These jobs can not be manually resubmitted and have a fixed maximum runtime (specified in the Data.unitsPerJob parameter), after which they gracefully stop processing input data. The remaining data will be processed in the next stage (tail) and their jobs labelled as "rescheduled" in the main stage (in the dashboard they will always appear as "failed").

Some possible "tail" stages. If some main job does not finish successfully ("rescheduled" in the previous stage) or does not completely process the amount of data assigned to it due to the automatically configured maximum job run time, tail jobs are created and submitted in order to fully process the dataset. Tail jobs have a job id of the form n-[1,2,3,...], where n=1,2,... represents the tail stage number. For small tasks, less than 100 jobs, one tail stage is started when all jobs have completed (successfully or failed). For larger tasks, a first tail stage collects all remaining input data from the first 50% of completed jobs, followed by a stage that processes data when 80% of jobs have completed, and finally a stage collecting leftover input data at 100% job completion. Failed tail jobs can be manually resubmitted by the users. Once the probe stage is completed, the plain crab status command shows only the main and tail jobs. For the list of all jobs add the --long option.

Given the above I really think we need to move to "Automatic" splitting everywhere. Seems quite convenient also the "tails" treatment.

This procedure created 938 jobs for data, 12 for the signal and for ZJets_MG 555

I updated the splitting to a bit larger values: 1250 for MC and 2700 for data. This will reduce the number of total files produced, and the "tails" should then consider the too long jobs.

git commit --amend --no-edit

[prod-v18.2.X 40c3a84] Trying automated job splitting Date: Thu Dec 5 16:18:49 2019 +0100 1 file changed, 6 insertions(+), 6 deletions(-)

Testing again:

python crab/make_crab_script.py -s SingleMu_2018A H2Mu_gg ZJets_MG_1
./crab_2019_12_05_18_49-prod-v18.2.0-3-g40c3a84/submit_all.sh

Adding some extra options for the make_crab_script.py

git commit -m "Adding testing options and username fetching for output dir." crab/make_crab_script.py

[prod-v18.2.X 3b89b5f] Adding testing options and username fetching for output dir. 1 file changed, 18 insertions(+), 14 deletions(-)

Preparing for full production of 3l skim using tag prod-v18.2.0.skim3l.

git tag -m "Main skim 3l production targeting Moriond 2020" prod-v18.2.0.skim3l
git push --tags
python crab/make_crab_script.py 
./crab_2019_12_09_17_58-prod-v18.2.0.skim3l/submit_all.sh

https://github.com/UFLX2MuMu/Ntupliser/releases/tag/prod-v18.2.0.skim3l

Global monitoring link: https://monit-grafana.cern.ch/d/cmsTMGlobal/cms-tasks-monitoring-globalview?orgId=11&var-user=bortigno&var-site=All&var-current_url=%2Fd%2FcmsTMDetail%2Fcms_task_monitoring&var-task=All&var-Filters=data.CRAB_Workflow%7C%3D~%7C.*.prod-v18p2p0.*

Overall status

For the data I have a fair amount of jobs failing with "too much RAM (50660)" or "too much wall clock time (50664)".

SingleMu_2018D graph

SingleMu_2018A graph

SingleMu_2018C graph

For SingleMu2018B there are no failed jobs with 50660/4 but some with 8021 and 8002.

SingleMu_2018B graph

For the moment all data failed jobs have been resubmitted as tail jobs. After the tail jobs fail we can in case resubmit by hand. More info here https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3FAQ#What_is_the_Automatic_splitting

I also have some SUBMITFAILED for

191209_172531:bortigno_crab_ggZZ_2e2mu_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l
191209_172613:bortigno_crab_ggZZ_2mu2tau_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l
191209_172652:bortigno_crab_ggZZ_2e2tau_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l

For the following reason:

Failure message from server: CRAB refuses to proceed in getting the details of the dataset

/GluGluToContinToZZTo2e2mu_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM
/GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM
/GluGluToContinToZZTo2e2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM

from DBS, because the dataset is not 'VALID' but 'INVALID'. To allow CRAB to consider a dataset that is not 'VALID', set Data.allowNonValidInputDataset = True in the CRAB configuration. Notice that this will not force CRAB to run over all files in the dataset; CRAB will still check if there are any valid files in the dataset and run only over those files.

and for

191209_171644:bortigno_crab_ZJets_hiM_MG_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l
191209_171443:bortigno_crab_ZJets_AMC_1_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l

because:

Failure message from server: Block


/DYJetsToLL_M-105To160_TuneCP5_PSweights_13TeV-madgraphMLM-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM#895fce45-3e72-411b-ab4d-36c366b1bb2d

/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15_ext2-v1/MINIAODSIM#29e36df7-ad7d-4ecb-82f4-3f54bc0e80bd


>contains more than 100000 lumis.
This blows up CRAB server memory
CRAB can only split this by ignoring lumi information. You can do this
using FileBased split algorithm and avoiding any additional request
wich may cause lumi information to be looked up. See CRAB FAQ for more info:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3FAQ

ggZZ samples "ggZZ_2e2mu" "ggZZ_2mu2tau" "ggZZ_2e2tau" that SUBMITFAILED had been invalidated and they are now using a new version in the "processed dataset" name.

python checkSampleOnDAS.py 

/GluGluToContinToZZTo2e2mu_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM does not correspond to any VALID DAS sample.
Possible alternatives: 
/GluGluToContinToZZTo2e2mu_*/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v*-v*/MINIAODSIM
[{'dataset': '/GluGluToContinToZZTo2e2mu_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM'}, {'dataset': '/GluGluToContinToZZTo2e2mu_13TeV_TuneCP5_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM'}, {'dataset': '/GluGluToContinToZZTo2e2mu_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v3/MINIAODSIM'}]
/GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM does not correspond to any VALID DAS sample.
Possible alternatives: 
/GluGluToContinToZZTo2mu2tau_*/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v*-v*/MINIAODSIM
[{'dataset': '/GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM'}, {'dataset': '/GluGluToContinToZZTo2mu2tau_13TeV_TuneCP5_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM'}, {'dataset': '/GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v4/MINIAODSIM'}]
/GluGluToContinToZZTo2e2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM does not correspond to any VALID DAS sample.
Possible alternatives: 
/GluGluToContinToZZTo2e2tau_*/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v*-v*/MINIAODSIM
[{'dataset': '/GluGluToContinToZZTo2e2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM'}, {'dataset': '/GluGluToContinToZZTo2e2tau_13TeV_TuneCP5_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v1/MINIAODSIM'}, {'dataset': '/GluGluToContinToZZTo2e2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v3/MINIAODSIM'}]

So I updated the process name in python/Samples.py and re-submitted:

git commit -m "Update version of ggZZ samples" python/Samples.py

[prod-v18.2.X 67a5bb1] Update version of ggZZ samples 1 file changed, 6 insertions(+), 4 deletions(-)

python crab/make_crab_script.py -s ggZZ_2e2mu ggZZ_2mu2tau ggZZ_2e2tau
./crab_2019_12_10_18_57-prod-v18.2.0.skim3l-1-g67a5bb1/submit_all.sh

Production going fine.

For the data the production after all automated resubmission the situation was the following:

SingleMu A : 2/295 failed SingleMu B : 1/111 failed SingleMu C : 0/123 failed SingleMu D : 125/625 failed

So I had to create a new process for the SignleMuD as described in the crab twikis

crab report -d /afs/cern.ch/work/b/bortigno/x2mm18_10211p1/src/Ntupliser/DiMuons/logs/crab_SingleMu_2018D_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l

Running crab status first to fetch necessary information. Will save lumi files into output directory /afs/cern.ch/work/b/bortigno/x2mm18_10211p1/src/Ntupliser/DiMuons/logs/crab_SingleMu_2018D_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l/results Summary from jobs in status 'finished': Number of files processed: 4568 Number of events read: 412454710 Number of events written in EDM files: 0 Number of events written in TFileService files: 0 Number of events written in other type of files: 0 Processed lumis written to processedLumis.json Warning: 'notFinished' lumis written to notFinishedLumis.json The 'notFinished' lumis were calculated as: the lumis to process minus the processed lumis. Additional report lumi files: Input dataset lumis (from DBS, at task submission time) written to inputDatasetLumis.json Lumis to process written to lumisToProcess.json Log file is /afs/cern.ch/work/b/bortigno/x2mm18_10211p1/src/Ntupliser/DiMuons/logs/crab_SingleMu_2018D_2019_12_09_17_58_prod2018_prod-v18p2p0pskim3l/crab.log

Then copied the config file in a new one "_missinglumi" and edit the new lumi_mask:

cp crab_2019_12_09_17_58-prod-v18.2.0.skim3l/configs/SingleMu_2018D.py crab_2019_12_09_17_58-prod-v18.2.0.skim3l/configs/SingleMu_2018D_missinglumi.py
vi crab_2019_12_09_17_58-prod-v18.2.0.skim3l/configs/SingleMu_2018D_missinglumi.py

and submitted

bortigno@lxplus601:~/workspace/x2mm18_10211p1/src/Ntupliser/DiMuons$ crab submit -c crab_2019_12_09_17_58-prod-v18.2.0.skim3l/configs/SingleMu_2018D_missinglumi.py

This created 95 jobs that are now fully successful.

When I tried to do the same with SingleMuA and SingleMuB the task expired already and I can not retrieve the lumi_mask.

When I tried to do the same with SingleMuA and SingleMuB the task expired already and I can not retrieve the lumi_mask.

Did you get these to work @bortigno ?

UFLX2MuMu / Ntupliser

Prod-v18.2.X #114