cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Crash when running ALCA:DQM step on ALCARECO in PilotBeam collision run #36014

Closed francescobrivio closed 2 years ago

francescobrivio commented 3 years ago

While setting up new relval worklfows to use the recent pp collisions from the 2021 PilotBeam I found this crash happening when I add ALCA:DQM step to the prompt workflow:

----- Begin Fatal Exception 05-Nov-2021 17:57:30 CET-----------------------
An exception of category 'NoProxyException' occurred while
   [0] Processing  stream begin Run run: 346511 stream: 2
   [1] Calling method for module SiStripMonitorCluster/'SiStripCalZeroBiasMonitorCluster'
Exception Message:
No data of type "TkDetMap" with label "" in record "TrackerTopologyRcd"
 Please add an ESSource or ESProducer to your job which can deliver this data.
----- End Fatal Exception -------------------------------------------------

The same wf and step (ALCA:DQM) running on cosmic data (138.1) does not crash.

Th recipe for reproducing the crash is:

echo '{
"346511" : [[1, 2]]
}' > step1_lumiRanges.log  2>&1

(dasgoclient --limit 0 --query 'lumi,file dataset=/ZeroBias/Commissioning2021-v1/RAW run=346511' --format json | das-selected-lumis.py 1,2 ) | ibeos-lfn-sort > step1_dasquery.log 2>&1

cmsDriver.py step2  --conditions auto:run3_data_prompt -s RAW2DIGI,L1Reco,RECO,EI,PAT,ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias+EcalESAlign,DQM:@standardDQMFakeHLT+@miniAODDQM --datatier RECO,MINIAOD,DQMIO --eventcontent RECO,MINIAOD,DQM --data  --process reRECO --scenario pp --era Run3 --customise Configuration/DataProcessing/RecoTLR.customisePrompt -n 100  --filein filelist:step1_dasquery.log --lumiToProcess step1_lumiRanges.log --fileout file:step2.root  --nThreads 4 > step2_PromptCollisions+RunZeroBias2021+RECODPROMPTRUN3+ALCAPROMPRRUN3+HARVESTDPROMPTR3.log 2>&1

cmsDriver.py step3  -s ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias+HcalCalHO+HcalCalIterativePhiSym+HcalCalHBHEMuonFilter+HcalCalIsoTrkFilter+DQM --conditions auto:run3_data_prompt --scenario pp --era Run3 --datatier ALCARECO --eventcontent ALCARECO --triggerResultsProcess RECO -n 100  --filein  file:step2.root  --fileout file:step3.root  --nThreads 4 > step3_PromptCollisions+RunZeroBias2021+RECODPROMPTRUN3+ALCAPROMPRRUN3+HARVESTDPROMPTR3.log  2>&1

The crash is observed both in 12_2_X and 12_0_X (I didn't test 12_1_X, but I guess it's safe to assume it will crash as well). This could possibly lead to crashing production when re-recoing the recent PilotBeam data.

cmsbuild commented 3 years ago

A new Issue was created by @francescobrivio .

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

francescobrivio commented 3 years ago

FYI @cms-sw/tracking-pog-l2 @cms-sw/trk-dpg-l2

makortel commented 3 years ago

assign dqm

cmsbuild commented 3 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@rvenditti,@emanueleusai,@pbo0,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 3 years ago

@francescobrivio I am not sure to understand why the whole

ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias

part is duplicated in step2 and step3.... Anyways this:

diff --git a/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py b/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py
index f5a8025a803..6a94c9d898f 100644
--- a/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py
+++ b/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py
@@ -2,6 +2,9 @@ import FWCore.ParameterSet.Config as cms

 from DQM.SiStripMonitorCluster.SiStripMonitorCluster_cfi import *

+# needed to run ALCA:DQM without RECO+DQM in the same step
+from CalibTracker.SiStripCommon.TkDetMapESProducer_cfi import *
+
 # SiStripMonitorCluster
 SiStripCalZeroBiasMonitorCluster = SiStripMonitorCluster.clone(
     ClusterProducerStrip = "calZeroBiasClusters",

makes step 3 work (haven't tested if it damages something else), though I have the impression that ALCA:SiStripCalZeroBias shouldn't be run in a step in which also RECO+DQM are not run.

finally let me note that in the log of step2 there's a bunch of:

%MSG-w BeamSpotFromDB:  BeamSpotOnlineProducer:scalerBeamSpot  05-Nov-2021 23:58:04 CET Run: 346511 Event: 639592
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot 
%MSG
francescobrivio commented 3 years ago

@francescobrivio I am not sure to understand why the whole

ALCA:SiStripCalZeroBias+SiStripCalMinBias+TkAlMinBias

part is duplicated in step2 and step3.... Anyways this:

That's an error in how I configured the job I think, i'm trying to fix it.

diff --git a/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py b/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py
index f5a8025a803..6a94c9d898f 100644
--- a/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py
+++ b/DQM/SiStripMonitorCluster/python/SiStripMonitorClusterAlca_cfi.py
@@ -2,6 +2,9 @@ import FWCore.ParameterSet.Config as cms

 from DQM.SiStripMonitorCluster.SiStripMonitorCluster_cfi import *

+# needed to run ALCA:DQM without RECO+DQM in the same step
+from CalibTracker.SiStripCommon.TkDetMapESProducer_cfi import *
+
 # SiStripMonitorCluster
 SiStripCalZeroBiasMonitorCluster = SiStripMonitorCluster.clone(
     ClusterProducerStrip = "calZeroBiasClusters",

makes step 3 work (haven't tested if it damages something else), though I have the impression that ALCA:SiStripCalZeroBias shouldn't be run in a step in which also RECO+DQM are not run.

finally let me note that in the log of step2 there's a bunch of:

%MSG-w BeamSpotFromDB:  BeamSpotOnlineProducer:scalerBeamSpot  05-Nov-2021 23:58:04 CET Run: 346511 Event: 639592
Online Beam Spot producer falls back to DB value because the ESProducer returned a fake beamspot 
%MSG

Ok, thanks for the quick solution! I have a followup question: do we need the DQM step for the prompt/express relval workflows?

mmusich commented 3 years ago

I have a followup question: do we need the DQM step for the prompt/express relval workflows?

you mean the standard DQM, or the ALCA DQM? in either case I'd say yes.

francescobrivio commented 2 years ago

The discussion has been moved to https://github.com/cms-sw/cmssw/pull/36133 where by reconfiguring (properly) the workflow this is not an issue anymore.