cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Consumes failure in `BPHMonitor` #39017

Closed mmusich closed 2 years ago

mmusich commented 2 years ago

Starting from CMSSW_12_5_X_2022-08-09-1100, some workflows are failing with:

----- Begin Fatal Exception 09-Aug-2022 18:35:43 CEST-----------------------
An exception of category 'BadToken' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 9 stream: 0
   [1] Running path 'dqmoffline_1_step'
   [2] Calling method for module BPHMonitor/'Dimuon0_L3TnP_Upsilon'
Exception Message:
A get using a EDGetToken with the C++ type 'BXVector<GlobalAlgBlk>' was made using an uninitialized token.
 Please check that the variable is being initialized from a 'consumes' call.
----- End Fatal Exception -------------------------------------------------

e.g. 11634.911. First spotted in https://github.com/cms-sw/cmssw/pull/38760#issuecomment-1209651811.

mmusich commented 2 years ago

assign dqm

cmsbuild commented 2 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @mmusich Marco Musich.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 2 years ago

@cms-sw/alca-l2 @cms-sw/hlt-l2 I am copying you as the more substantive recent changes are on your area.

mmusich commented 2 years ago

Actually also https://github.com/cms-sw/cmssw/pull/38917 could be the culprit: @Dr15Jones

makortel commented 2 years ago

The log of 11634.911 step 2 has many

%MSG-w L1GtUtils:  L1TRawToDigi:gtStage2Digis@ctor 09-Aug-2022 14:22:40 CEST  pre-events
Found preferred tag InputTag:  label = gtStage2Digis, instance = , process = RECO
 after having set unpreferred tag (InputTag:  label = valGtStage2Digis, instance = , process = RECO) for GlobalAlgBlkBxCollection.
 Please change configuration to explicitly use the tag given above.
 This will avoid unnecessary prefetching of data not used.
%MSG

and

%MSG-e L1GtUtils:  AfterSourceConstruction 09-Aug-2022 14:23:13 CEST pre-events
Found multiple preferred input tags for GlobalAlgBlkBxCollection product, 
with different instaces or processes.
Tag already found: InputTag:  label = gtStage2Digis, instance = , process = RECO
Other tag: InputTag:  label = hltGtStage2Digis, instance = , process = HLT
Token set to invalid.
%MSG

(GlobalAlgBlkBxCollection is a type alias for BXVector<GlobalAlgBlk>). Unfortunately these messages don't tell anything about the modules that are using L1TGlobalUtilsHelper (we are thinking ways to improve the message decorations or something for that).

It seems to me that there are many modules that have been configured incorrect for a long time, and now https://github.com/cms-sw/cmssw/pull/38917 (and https://github.com/cms-sw/cmssw/pull/38900) reveal those, BPHMonitor likely being one of them.

makortel commented 2 years ago

Adding also @cms-sw/l1-l2

Dr15Jones commented 2 years ago

The warning about Found multiple preferred input tags is most likely the origin of the problem. I took another look at the code before my change. Previously, when multiple preferred input tags were found the code would issue a LogDebug message, unset the edm::InputTag it has previously set but NOT reset the EDGetTokenT it had set. So the code would get which ever data it happened to be handed first, which is NOT a good behavior.

Dr15Jones commented 2 years ago

So I changed the Error message to an exception and added context about the module and now I see

----- Begin Fatal Exception 09-Aug-2022 14:42:17 CDT-----------------------
An exception of category 'L1GtUtils::TooManyChoices' occurred while
   [0] Constructing the EventProcessor
   [1] Running 'callWhenNewProductRegistered' for module Dimuon25_Jpsi_tnp
Exception Message:
Found multiple preferred input tags for GlobalAlgBlkBxCollection product, 
with different instaces or processes.
Tag already found: InputTag:  label = gtStage2Digis, instance = , process = RECO
Other tag: InputTag:  label = hltGtStage2Digis, instance = , process = HLT
----- End Fatal Exception -------------------------------------------------
Dr15Jones commented 2 years ago

Looking at the configuration I find

>>> print(process.Dimuon25_Jpsi_tnp.dumpPython())
cms.EDProducer("BPHMonitor",
    DMSelection_ref = cms.string('Pt>4 & abs(eta)'),
    FolderName = cms.string('HLT/BPH/DiMu25_Jpsi_noCorr/'),
    Jpsi = cms.int32(0),
    L3 = cms.int32(0),
    Upsilon = cms.int32(0),
    beamSpot = cms.InputTag("offlineBeamSpot"),
    denGenericTriggerEventPSet = cms.PSet(
        ReadPrescalesFromFile = cms.bool(True),
        andOr = cms.bool(False),
        andOrDcs = cms.bool(False),
        andOrHlt = cms.bool(True),
        andOrL1 = cms.bool(False),
        dbLabel = cms.string(''),
        dcsInputTag = cms.InputTag("scalersRawToDigi"),
        dcsPartitions = cms.vint32(),
        dcsRecordInputTag = cms.InputTag("onlineMetaDataDigis"),
        errorReplyDcs = cms.bool(True),
        errorReplyHlt = cms.bool(False),
        errorReplyL1 = cms.bool(False),
        hltDBKey = cms.string(''),
        hltInputTag = cms.InputTag("TriggerResults","","HLT"),
        hltPaths = cms.vstring('HLT_Mu7p5_Track2_Jpsi_v*'),
        l1Algorithms = cms.vstring(),
        l1BeforeMask = cms.bool(False),
        l1tAlgBlkInputTag = cms.InputTag("gtStage2Digis"),
        l1tExtBlkInputTag = cms.InputTag("gtStage2Digis"),
        stage2 = cms.bool(True),
        verbosityLevel = cms.uint32(0)
    ),
    displaced = cms.int32(0),
    enum = cms.int32(1),
    histoPSet = cms.PSet(
        BmassPSet = cms.PSet(
            nbins = cms.uint32(20),
            xmax = cms.double(5.5),
            xmin = cms.double(5.1)
        ),
        cosPSet = cms.PSet(
            nbins = cms.uint32(10),
            xmax = cms.double(1),
            xmin = cms.double(0.9)
        ),
        d0PSet = cms.PSet(
            nbins = cms.uint32(50),
            xmax = cms.double(5),
            xmin = cms.double(-5.0)
        ),
        dMuPtBinning = cms.vdouble(
            6, 8, 12, 16, 20,
            25, 30, 35, 40, 50,
            70
        ),
        dRPSet = cms.PSet(
            nbins = cms.uint32(26),
            xmax = cms.double(1.3),
            xmin = cms.double(0)
        ),
        dcaPSet = cms.PSet(
            nbins = cms.uint32(10),
            xmax = cms.double(0.5),
            xmin = cms.double(0)
        ),
        dsPSet = cms.PSet(
            nbins = cms.uint32(15),
            xmax = cms.double(60),
            xmin = cms.double(0)
        ),
        etaPSet = cms.PSet(
            nbins = cms.uint32(12),
            xmax = cms.double(2.4),
            xmin = cms.double(-2.4)
        ),
        massPSet = cms.PSet(
            nbins = cms.uint32(140),
            xmax = cms.double(7.0),
            xmin = cms.double(0)
        ),
        phiPSet = cms.PSet(
            nbins = cms.uint32(8),
            xmax = cms.double(3.2),
            xmin = cms.double(-3.2)
        ),
        probBinning = cms.vdouble(
            0.01, 0.02, 0.04, 0.06, 0.08,
            0.1, 0.2, 0.3, 0.4, 0.5,
            0.6, 0.7, 0.8, 0.9, 1.0
        ),
        ptBinning = cms.vdouble(
            -0.5, 0, 2, 4, 8,
            10, 12, 16, 20, 25,
            30, 35, 40, 50
        ),
        z0PSet = cms.PSet(
            nbins = cms.uint32(60),
            xmax = cms.double(15),
            xmin = cms.double(-15)
        )
    ),
    hltTriggerSummaryAOD = cms.InputTag("hltTriggerSummaryAOD","","HLT"),
    kaon_mass = cms.double(0.493677),
    max_dR = cms.double(1.4),
    maxmass = cms.double(3.596),
    maxmassJpsi = cms.double(3.2),
    maxmassJpsiTk = cms.double(5.46),
    maxmassTkTk = cms.double(10),
    maxmassUpsilon = cms.double(10),
    mightGet = cms.optional.untracked.vstring,
    minDS = cms.double(3),
    min_dR = cms.double(0.001),
    mincos = cms.double(0.95),
    minmass = cms.double(2.596),
    minmassJpsi = cms.double(3),
    minmassJpsiTk = cms.double(5.1),
    minmassTkTk = cms.double(0),
    minmassUpsilon = cms.double(8.8),
    minprob = cms.double(0.005),
    mu_mass = cms.double(0.1056583745),
    muoSelection = cms.string('abs(eta)<2.4 & isPFMuon & isGlobalMuon  & innerTrack.hitPattern.trackerLayersWithMeasurement>5 & innerTrack.hitPattern.numberOfValidPixelHits>0 '),
    muoSelection_probe = cms.string('isPFMuon & isGlobalMuon  & innerTrack.hitPattern.trackerLayersWithMeasurement>5 & innerTrack.hitPattern.numberOfValidPixelHits> 0'),
    muoSelection_ref = cms.string('pt>7.5 && abs(eta)<2.4 & isPFMuon & isGlobalMuon  & innerTrack.hitPattern.trackerLayersWithMeasurement>5 & innerTrack.hitPattern.numberOfValidPixelHits>0 '),
    muoSelection_tag = cms.string('isGlobalMuon && isPFMuon && isTrackerMuon && abs(eta) < 2.4 && innerTrack.hitPattern.numberOfValidPixelHits > 0 && innerTrack.hitPattern.trackerLayersWithMeasurement > 5 && globalTrack.hitPattern.numberOfValidMuonHits > 0 && globalTrack.normalizedChi2 < 10'),
    muons = cms.InputTag("muons"),
    nmuons = cms.int32(1),
    numGenericTriggerEventPSet = cms.PSet(
        ReadPrescalesFromFile = cms.bool(True),
        andOr = cms.bool(False),
        andOrDcs = cms.bool(False),
        andOrHlt = cms.bool(True),
        andOrL1 = cms.bool(True),
        dbLabel = cms.string(''),
        dcsInputTag = cms.InputTag("scalersRawToDigi"),
        dcsPartitions = cms.vint32(),
        dcsRecordInputTag = cms.InputTag("onlineMetaDataDigis"),
        errorReplyDcs = cms.bool(False),
        errorReplyHlt = cms.bool(False),
        errorReplyL1 = cms.bool(True),
        hltDBKey = cms.string(''),
        hltInputTag = cms.InputTag("TriggerResults","","HLT"),
        hltPaths = cms.vstring('HLT_Dimuon25_Jpsi_noCorrL1_v*'),
        l1Algorithms = cms.vstring(),
        l1BeforeMask = cms.bool(True),
        l1tAlgBlkInputTag = cms.InputTag("gtStage2Digis"),
        l1tExtBlkInputTag = cms.InputTag("gtStage2Digis"),
        stage2 = cms.bool(True),
        verbosityLevel = cms.uint32(0)
    ),
    offlinePVs = cms.InputTag("offlinePrimaryVertices"),
    photons = cms.InputTag("photons"),
    ptCut = cms.int32(0),
    requireValidHLTPaths = cms.bool(True),
    seagull = cms.int32(0),
    stageL1Trigger = cms.uint32(2),
    tnp = cms.bool(True),
    trOrMu = cms.int32(0),
    trSelection_ref = cms.string(''),
    tracks = cms.InputTag("generalTracks")
)

which seems to show the inputs ARE specified. So something is up.

missirol commented 2 years ago

I don't fully follow yet, but I leave a quick guess: I suspect an instance of HLTPrescaleProvider (created by BPHMonitor) is creating a L1TGlobalUtil which in turn makes a L1TGlobalUtilHelper with empty InputTags, and that could lead to the routine searching for the preferred input collection, then the warning/error (links below).

Just for my education: in IB CMSSW_12_5_X_2022-08-09-1100, I see 11634.911 failing for el8_amd64_gcc1*, but not for slc7_amd64_gcc10. Why is that?


https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/DQMOffline/Trigger/plugins/BPHMonitor.cc#L237 https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/HLTrigger/HLTcore/interface/HLTPrescaleProvider.h#L126 https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/L1Trigger/L1TGlobal/interface/L1TGlobalUtil.h#L255 https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/L1Trigger/L1TGlobal/interface/L1TGlobalUtilHelper.h#L95

Dr15Jones commented 2 years ago

So there appears to be a 'hidden' use of L1TGlobalUtilHelper in BPHMonitor and it is passed the "wrong" configuration.

https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/DQMOffline/Trigger/plugins/BPHMonitor.cc#L237

where the module's full configuration is passed in which goes to https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/HLTrigger/HLTcore/interface/HLTPrescaleProvider.h#L126

which goes to https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/L1Trigger/L1TGlobal/interface/L1TGlobalUtil.h#L255

and finally to

https://github.com/cms-sw/cmssw/blob/e52f99a59bb77ceec0102769446b26c321662ae8/L1Trigger/L1TGlobal/interface/L1TGlobalUtilHelper.h#L107-L117

and the top level configuration for the module (which is what was passed) is of no use here and therefore we get the bad behavior.

Dr15Jones commented 2 years ago

So the question is, what is the purpose of the HLTPrescaleProvider in BPHMonitor and how was is SUPPOSED to be configured?

makortel commented 2 years ago

Just for my education: in IB CMSSW_12_5_X_2022-08-09-1100, I see 11634.911 failing for el8_amd64_gcc1*, but not for slc7_amd64_gcc10. Why is that?

No clue, but I'm also wondering why we see the failures only in the .911 (DD4Hep) and .914 (DDD) workflow variations rather than everywhere.

Dr15Jones commented 2 years ago

I feel that the underlying failure in the design of this code is the use of existAs to test if a parameter is available.

missirol commented 2 years ago

So the question is, what is the purpose of the HLTPrescaleProvider in BPHMonitor

It is used in various ways inside BPHMonitor, for one to get the value of the prescale(s) of certain triggers, and this info is used to fill certain DQM outputs.

and how was is SUPPOSED to be configured?

I'd say using gtStage2Digis.

Taking wf 11634.911 as an example, hltGtStage2Digis (L1T digis of step-2) and gtStage2Digis (L1T digis of step-3) are not literally the same, but I think they are effectively the same [*]. Other parts of the DQM module of type BPHMonitor (and this likely applies to other DQM modules) use gtStage2Digis, so that would likely be the one to pass to HLTPrescaleProvider.

I plan to have a deeper look and, since this involves HLTPrescaleProvider, I will assign HLT.

[*]

hltGtStage2Digis = cms.EDProducer("L1TRawToDigi",
    CTP7 = cms.untracked.bool(False),
    DmxFWId = cms.uint32(0),
    FWId = cms.uint32(0),
    FWOverride = cms.bool(False),
    FedIds = cms.vint32(1404),
    InputLabel = cms.InputTag("rawDataCollector"),
    MTF7 = cms.untracked.bool(False),
    MinFeds = cms.uint32(0),
    Setup = cms.string('stage2::GTSetup'),
    TMTCheck = cms.bool(True),
    debug = cms.untracked.bool(False),
    lenAMC13Header = cms.untracked.int32(8),
    lenAMC13Trailer = cms.untracked.int32(8),
    lenAMCHeader = cms.untracked.int32(8),
    lenAMCTrailer = cms.untracked.int32(0),
    lenSlinkHeader = cms.untracked.int32(8),
    lenSlinkTrailer = cms.untracked.int32(8)
)

gtStage2Digis = cms.EDProducer("L1TRawToDigi",
    FedIds = cms.vint32(1404),
    InputLabel = cms.InputTag("rawDataCollector"),
    MinFeds = cms.uint32(1),
    Setup = cms.string('stage2::GTSetup')
)

In my understanding, the parameters not specified in gtStage2Digis use the default values from fillDescriptions. Comparing to those, the only difference between the two modules is MinFeds, which is 0 at HLT and 1 in step-3. On the other hand, this parameter only controls the appearance of a LogWarning, so the actual numbers inside hltGtStage2Digis and gtStage2Digis should be the same (unless I'm missing something..).

missirol commented 2 years ago

assign hlt

cmsbuild commented 2 years ago

New categories assigned: hlt

@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 2 years ago

In my understanding, the parameters not specified in gtStage2Digis use the default values from fillDescriptions.

Unfortunately, the BPHMonitor::fillDescriptions does not contain entries needed to configure the InputTags used by the HLTPrescaleProvider. I'm presently working on a PR that does add the needed parameters and sets them to gtStage2Digis by default.

fwyzard commented 2 years ago

I'd say using gtStage2Digis.

I agree. In general (for example, during production jobs), hltGtStage2Digis should not be available during step 3.

missirol commented 2 years ago

+hlt