missirol commented 1 year ago

The reproducer in [1] (CMSSW_13_0_5_patch1, input file on lxplus) tries to use options.skipEvent (documented, for example, in SWGuideEdmExceptionUse#Framework_Exception_Handling) in order skip an event which is known to throw an exception of type "InvalidGlobalAlgBlkBxCollection" from the module hltStage2GtDigis.

Naively, I was expecting the job to skip the event and succeed. Instead, I see that the job fails because a different module on one EndPath throws a different exception ("ProductNotFound") while attempting to access the products of hltStage2GtDigis (which are likely not produced because hltStage2GtDigis fails due to "InvalidGlobalAlgBlkBxCollection"). The error message of the reproducer is in [2]. @fwyzard spotted that the message quotes "Begin IgnoreCompletely", and does not quote "Begin SkipEvent". A simple search leads me to this:

      // If we are processing an endpath and the module was scheduled, treat SkipEvent or FailPath
      // as IgnoreCompletely, so any subsequent OutputModules are still run.
      // For unscheduled modules only treat FailPath as IgnoreCompletely but still allow SkipEvent to throw

One workaround is to include ProductNotFound in options.skipEvent.

Question: are there "better" ways ?

Context : this issue is related to https://github.com/cms-sw/cmssw/issues/41489#issuecomment-1532696126, as we look into the feasibility of using options.skipEvent to avoid the frequent HLT crashes seen online these days due to the L1T unpacker (CMSLITOPS-411).

FYI: @silviodonato @cms-sw/hlt-l2

[1]

#!/bin/bash

# cmsrel CMSSW_13_0_5_patch1
# cd CMSSW_13_0_5_patch1/src
# cmsenv

INPUTF=/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.raw
[ $# -eq 0 ] || INPUTF="${1}"

rm -rf run000000
mkdir run000000

hltConfigFromDB --runNumber 366469 > hlt.py

cat <<@EOF >> hlt.py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')
process.MessageLogger.cerr.FwkReport.reportEvery = 1
process.MessageLogger.cerr.enableStatistics = False
process.MessageLogger.cerr.threshold = 'INFO'

process.source.fileListMode = True
process.source.fileNames = [ "${INPUTF}" ]

process.options.SkipEvent = cms.untracked.vstring(
  'InvalidGlobalAlgBlkBxCollection',
#  'ProductNotFound',
)

#del process.DQMHistograms
@EOF

cmsRun hlt.py #&> hlt.log

[2]

----- Begin IgnoreCompletely Exception 03-May-2023 18:48:06 CEST-----------------------
An exception of category 'InvalidGlobalAlgBlkBxCollection' occurred while
   [0] Processing  Event run: 366497 lumi: 196 event: 251484500 stream: 0
   [1] Running path 'RatesMonitoring'
   [2] Calling method for module L1TRawToDigi/'hltGtStage2Digis'
Exception Message:
The GlobalAlgBlk unpacker result vector is empty, but is not receiving the first expected header ID! This may be due to corrupted, or poorly formatted events.
uGTBoard: 0
BX: -2
First expected block: 33
Received block: 37
----- End IgnoreCompletely Exception -------------------------------------------------
%MSG-e L1TriggerJSONMonitoring:   L1TriggerJSONMonitoring:hltL1TriggerJSONMonitoring 03-May-2023 18:48:06 CEST  Run: 366497 Event: 251484500
L1 trigger results with label [hltGtStage2Digis] not present or invalid
%MSG
----- Begin Fatal Exception 03-May-2023 18:48:06 CEST-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 366497 lumi: 196 event: 251484500 stream: 0
   [1] Running path 'DQMHistograms'
   [2] Calling method for module TriggerBxMonitor/'hltTriggerBxMonitor'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: BXVector<GlobalAlgBlk>
Looking for module label: hltGtStage2Digis
Looking for productInstanceName: 

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------
%MSG-w FastMonitoringService:  PostProcessPath 03-May-2023 18:48:06 CEST  Run: 366497 Event: 251484500
 STREAM 0 earlyTermination -: ID:run: 366497 lumi: 196 event: 251484500 LS:196  FromThisContext 
%MSG

cmsbuild commented 1 year ago

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign core

cmsbuild commented 1 year ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

Is there a specific reason for these consuming modules (whose ProductNotFound exception you'd like to ignore) to be in an EndPath?

missirol commented 1 year ago

I think so.

The "consuming" modules that can throw a "ProductNotFound" in this case are instances of TriggerRatesMonitor and TriggerBxMonitor. I think they are on a EndPath in the current HLT menu in order to consume the TriggerResults of the process they are in (meaning, cms.Process("HLT")).

I'm thinking that for these DQM-oriented modules a LogError might be more appropriate than an Exception, following what we already do in a similar case, i.e. L1TriggerJSONMonitoring.cc#L244.

missirol commented 1 year ago

Just to add something which is not obvious: the EndPaths that contain instances of TriggerRatesMonitor, TriggerBxMonitor and *TriggerJSONMonitoring are not part of the "offline" HLT menus that are included in CMSSW under HLTrigger/Configuration. Those EndPaths are only part of the HLT menus that we actually run online (for example, they produce outputs used by the online DQM).

makortel commented 1 year ago

It seems like you have explored the limits of the framework's exception handling mechanisms. I should note that the behavior of SkipEvent has not been stress-tested or used in production (for a long time at least), so some unexpected behavior might occur.

Another caveat is that are the relevant modules exception safe? I mean, we typically treat (most) exceptions as fatal and shut down the framework, in which case module e.g. managing resources in non-RAII way is not too bad in practice (as the OS cleans up the memory and file handles anyway). But continuing data processing with an exception thrown from a non-exception-safe can result undesirable behavior later in the process. (I didn't check either of the modules in question here).

In the short term I'd look into making the TriggerRatestMonitor, possibly also the L1TRawToDigi, to not throw exceptions (in these cases, at least; I though HLT had a policy or desire to avoid exceptions when processing can continue?) in the first place.

An alternative could be to extend TriggerResultsFilter to handle the case where some Path is in an error state (by quick look it currently just checks if a Path was accepted), like being able to reject the Event if any Path is in an error state. Then you could use such TriggerResultsFilter in those EndPaths that you'd want to not run in case of SkipEvent exception.

fwyzard commented 1 year ago

Another caveat is that are the relevant modules exception safe?

Just to clarify: the module is not destroyed, it "just" exits from produce() when the exception is thrown. Correct ?

In the short term I'd look into making [...] L1TRawToDigi, to not throw exceptions

Do you mean to change L1TRawToDigi to - for example - issue a LogError and not produce any collection, instead of throwing an exception ? That could work, but in turn it would require to change all downstream consumers to check if the collections are actually available, and so on.

makortel commented 1 year ago

Another caveat is that are the relevant modules exception safe?

Just to clarify: the module is not destroyed, it "just" exits from produce() when the exception is thrown. Correct ?

Correct.

In the short term I'd look into making [...] L1TRawToDigi, to not throw exceptions

Do you mean to change L1TRawToDigi to - for example - issue a LogError and not produce any collection, instead of throwing an exception ? That could work, but in turn it would require to change all downstream consumers to check if the collections are actually available, and so on.

Yes, or produce an empty data product.

missirol commented 1 year ago

In the short term I'd look into making the TriggerRatestMonitor, possibly also the L1TRawToDigi, to not throw exceptions (in these cases, at least; I though HLT had a policy or desire to avoid exceptions when processing can continue?) in the first place.

For TriggerRatestMonitor, I agree (this goes along the lines of https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1533523721), and it is easy enough for me to implement.

For L1TRawToDigi, I'm less certain. Implementing changes to it likely requires more L1T-sw expertise than I have. In addition, I think that one concern is that not throwing an exception when 'something' fundamentally wrong happens in L1TRawToDigi (e.g. corrupted data) could make it more difficult to (1) realise that there is a problem online, and (2) push experts to fix the root cause of the issue (the possibility of actually using skipEvent online is only being contemplated now, after many days with hundreds of HLT crashes, and obviously it would be a temporary measure).

An alternative could be to extend TriggerResultsFilter to handle the case where some Path is in an error state

Thanks, that's an interesting suggestion (I hadn't thought about it).

missirol commented 1 year ago

Aside from the complications related to EndPaths, I think I still need to understand better what happens in cms.Paths only, when there is a module that throws a cms::Exception and one tries to handle it with options.SkipEvent.

I'm trying to understand what happens in the reproducer in [1].

There are 3 events, the first module in both Paths throws on the 2nd event, and options.SkipEvents is configured to skip events with that Exception.
I added printouts to the ::filter method of HLTPrescaler and HLTBool [2] to see if that is accessed in the 'bad' event.
The output of the reproducer is in [3]: TrigReport suggests that modules like filterAlwaysTrue1, intProducer1 and intProducer2 "Passed" for all 3 events, and printouts confirm that ::filter ran in event-2 for modules like filterAlwaysTrue1 and filterAlwaysTrue2 (while I would have expected Visited=Passed=2 for those modules).
The output ROOT file contains only 2 events, so it looks as expected (the event not present in the output file is event-2, the one with the Exception).

I tried a 2nd reproducer [4], which is like [1] except for the fact that the 2nd Path is removed (so, there is only 1 Path, and 1 EndPath). The output of the 2nd reproducer is in [5], and it is stranger.

The TrigReport info of path1 changes, although all I did was to remove path2 without touching path1: TrigReport says that the filter right after throwException, i.e. filterAlwaysTrue1, ran for all 3 events, while the ensuing modules in path1 only ran for 2.
The output ROOT file again contains only the 2 events where throwException did not throw.

I don't understand most of what happens in path1. What am I missing ?

[1] cmssw41512_testException1_cfg.py

import FWCore.ParameterSet.Config as cms

# Configuration file based on FWCore/Integration/test/testFrameworkExceptionHandling_cfg.py

badEventID = cms.untracked.EventID(1, 1, 2)

nStreams = 1
nRuns = 1
nLumisPerRun = 1
nEventsPerLumi = 3

nEventsPerRun = nLumisPerRun*nEventsPerLumi
nLumis = nRuns*nLumisPerRun
nEvents = nRuns*nEventsPerRun

process = cms.Process('TEST')

process.source = cms.Source('EmptySource',
    firstRun = cms.untracked.uint32(1),
    firstLuminosityBlock = cms.untracked.uint32(1),
    firstEvent = cms.untracked.uint32(1),
    numberEventsInLuminosityBlock = cms.untracked.uint32(nEventsPerLumi),
    numberEventsInRun = cms.untracked.uint32(nEventsPerRun),
)

process.maxEvents = cms.untracked.PSet(
    input = cms.untracked.int32(nEvents)
)

process.options = cms.untracked.PSet(
    numberOfThreads = cms.untracked.uint32(nStreams),
    numberOfStreams = cms.untracked.uint32(nStreams),
    numberOfConcurrentRuns = cms.untracked.uint32(1),
    numberOfConcurrentLuminosityBlocks = cms.untracked.uint32(2)
)

process.load('FWCore.PrescaleService.PrescaleService_cfi')
process.PrescaleService.lvl1Labels = ['Col0', 'Col1']
process.PrescaleService.lvl1DefaultLabel = 'Col0'
process.PrescaleService.prescaleTable = cms.VPSet(
    cms.PSet(
        pathName = cms.string('path1'),
        prescales = cms.vuint32(1, 1)
    )
)

process.throwException = cms.EDProducer('ExceptionThrowingProducer',
    eventIDThrowOnEvent = badEventID
)

process.hltPrePath1 = cms.EDFilter( 'HLTPrescaler',
    offset = cms.uint32( 0 ),
    L1GtReadoutRecordTag = cms.InputTag( 'hltGtStage2Digis' )
)

process.filterAlwaysTrue1 = cms.EDFilter( 'HLTBool', result = cms.bool( True ))
process.filterAlwaysTrue2 = process.filterAlwaysTrue1.clone()
process.filterAlwaysTrue3 = process.filterAlwaysTrue1.clone()

process.intProducer1 = cms.EDProducer('ManyIntProducer', ivalue = cms.int32( 1 ))
process.intProducer2 = process.intProducer1.clone( ivalue = 2 )

process.path1 = cms.Path(
    process.throwException
  + process.filterAlwaysTrue1
  + process.hltPrePath1
  + process.filterAlwaysTrue2
  + process.intProducer1
  + process.filterAlwaysTrue3
)

process.path2 = cms.Path(
    process.throwException
  + process.intProducer2
)

process.theOutputModule = cms.OutputModule('PoolOutputModule',
    fileName = cms.untracked.string( 'cmssw41512_testException1.root' ),
    SelectEvents = cms.untracked.PSet( SelectEvents = cms.vstring(
        'path1',
        'path2',
    )),
    outputCommands = cms.untracked.vstring( 'keep *' )
)

process.endpath1 = cms.EndPath(
    process.theOutputModule
)

process.options.wantSummary = True
process.options.SkipEvent = cms.untracked.vstring( 'IntentionalTestException' )

[2]

diff --git a/HLTrigger/HLTcore/plugins/HLTPrescaler.cc b/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
index 8841b0c3024..4c7e63e6646 100644
--- a/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
+++ b/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
@@ -68,6 +68,9 @@ void HLTPrescaler::beginLuminosityBlock(edm::LuminosityBlock const& lb, edm::Eve
 bool HLTPrescaler::filter(edm::Event& iEvent, const edm::EventSetup&) {
   // during the first event of a LumiSection, read from the GT the prescale index for this
   // LumiSection and get the corresponding prescale factor from the PrescaleService
+
+  edm::LogPrint("HLTPrescaler") << "HLTPrescaler -- " << iEvent.id() << ", module = " << moduleDescription().moduleLabel() << ", path = " << iEvent.moduleCallingContext()->placeInPathContext()->pathContext()->pathName();
+
   if (newLumi_) {
     newLumi_ = false;

diff --git a/HLTrigger/HLTfilters/plugins/BuildFile.xml b/HLTrigger/HLTfilters/plugins/BuildFile.xml
index cb29742c3a0..12f2cba6bc7 100644
--- a/HLTrigger/HLTfilters/plugins/BuildFile.xml
+++ b/HLTrigger/HLTfilters/plugins/BuildFile.xml
@@ -18,6 +18,7 @@
 <use name="FWCore/Framework"/>
 <use name="FWCore/MessageLogger"/>
 <use name="FWCore/ParameterSet"/>
+<use name="FWCore/ServiceRegistry"/>
 <use name="FWCore/Utilities"/>
 <use name="HLTrigger/HLTcore"/>
 <flags EDM_PLUGIN="1"/>
diff --git a/HLTrigger/HLTfilters/plugins/HLTBool.cc b/HLTrigger/HLTfilters/plugins/HLTBool.cc
index 1d229804150..dafe18de3f5 100644
--- a/HLTrigger/HLTfilters/plugins/HLTBool.cc
+++ b/HLTrigger/HLTfilters/plugins/HLTBool.cc
@@ -12,6 +12,10 @@
 #include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h"
 #include "FWCore/ParameterSet/interface/ParameterSetDescription.h"

+#include "FWCore/ServiceRegistry/interface/PathContext.h"
+#include "FWCore/ServiceRegistry/interface/PlaceInPathContext.h"
+#include "FWCore/ServiceRegistry/interface/ModuleCallingContext.h"
+
 //
 // constructors and destructor
 //
@@ -32,4 +36,9 @@ void HLTBool::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
 //

 // ------------ method called to produce the data  ------------
-bool HLTBool::filter(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const { return result_; }
+bool HLTBool::filter(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const {
+
+  edm::LogPrint("HLTBool") << "HLTBool -- " << event.id() << ", module = " << moduleDescription().moduleLabel() << ", path = " << event.moduleCallingContext()->placeInPathContext()->pathContext()->pathName();
+
+  return result_;
+}

[3]

Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 07-May-2023 15:13:29.405 CEST
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 1, module = hltPrePath1, path = path1
%MSG-w HLT:  HLTPrescaler:hltPrePath1 07-May-2023 15:13:29 CEST Run: 1 Event: 1
Cannot read prescale column index from GT1 data: using default as defined by configuration or DAQ
%MSG
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue3, path = path1
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 07-May-2023 15:13:29.409 CEST
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 2, module = hltPrePath1, path = path1
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue3, path = path1
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 07-May-2023 15:13:29.410 CEST
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 3, module = hltPrePath1, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue3, path = path1

TrigReport ---------- Event  Summary ------------
TrigReport Events total = 3 passed = 2 failed = 1

TrigReport ---------- Path   Summary ------------
TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
TrigReport     1    0          3          2          1          0 path1
TrigReport     1    1          3          2          1          0 path2

TrigReport -------End-Path   Summary ------------
TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
TrigReport     0    0          3          3          0          0 endpath1

TrigReport ---------- Modules in Path: path1 ------------
TrigReport  Trig Bit#    Visited     Passed     Failed      Error Name
TrigReport     1    0          3          2          0          1 throwException
TrigReport     1    1          3          3          0          0 filterAlwaysTrue1
TrigReport     1    2          3          3          0          0 hltPrePath1
TrigReport     1    3          3          3          0          0 filterAlwaysTrue2
TrigReport     1    4          3          3          0          0 intProducer1
TrigReport     1    5          3          3          0          0 filterAlwaysTrue3

TrigReport ---------- Modules in Path: path2 ------------
TrigReport  Trig Bit#    Visited     Passed     Failed      Error Name
TrigReport     1    0          3          2          0          1 throwException
TrigReport     1    1          3          3          0          0 intProducer2

TrigReport ------ Modules in End-Path: endpath1 ------------
TrigReport  Trig Bit#    Visited     Passed     Failed      Error Name
TrigReport     0    0          3          3          0          0 theOutputModule

TrigReport ---------- Module Summary ------------
TrigReport    Visited   Executed     Passed     Failed      Error Name
TrigReport          3          3          3          0          0 TriggerResults
TrigReport          3          3          3          0          0 endpath1
TrigReport          3          3          3          0          0 filterAlwaysTrue1
TrigReport          3          3          3          0          0 filterAlwaysTrue2
TrigReport          3          3          3          0          0 filterAlwaysTrue3
TrigReport          3          3          3          0          0 hltPrePath1
TrigReport          3          3          3          0          0 intProducer1
TrigReport          3          3          3          0          0 intProducer2
TrigReport          3          3          3          0          0 path1
TrigReport          3          3          3          0          0 path2
TrigReport          3          3          3          0          0 theOutputModule
TrigReport          6          3          2          0          1 throwException

[4] cmssw41512_testException2_cfg.py

import FWCore.ParameterSet.Config as cms

# Configuration file based on FWCore/Integration/test/testFrameworkExceptionHandling_cfg.py

badEventID = cms.untracked.EventID(1, 1, 2)

nStreams = 1
nRuns = 1
nLumisPerRun = 1
nEventsPerLumi = 3

nEventsPerRun = nLumisPerRun*nEventsPerLumi
nLumis = nRuns*nLumisPerRun
nEvents = nRuns*nEventsPerRun

process = cms.Process('TEST')

process.source = cms.Source('EmptySource',
    firstRun = cms.untracked.uint32(1),
    firstLuminosityBlock = cms.untracked.uint32(1),
    firstEvent = cms.untracked.uint32(1),
    numberEventsInLuminosityBlock = cms.untracked.uint32(nEventsPerLumi),
    numberEventsInRun = cms.untracked.uint32(nEventsPerRun),
)

process.maxEvents = cms.untracked.PSet(
    input = cms.untracked.int32(nEvents)
)

process.options = cms.untracked.PSet(
    numberOfThreads = cms.untracked.uint32(nStreams),
    numberOfStreams = cms.untracked.uint32(nStreams),
    numberOfConcurrentRuns = cms.untracked.uint32(1),
    numberOfConcurrentLuminosityBlocks = cms.untracked.uint32(2)
)

process.load('FWCore.PrescaleService.PrescaleService_cfi')
process.PrescaleService.lvl1Labels = ['Col0', 'Col1']
process.PrescaleService.lvl1DefaultLabel = 'Col0'
process.PrescaleService.prescaleTable = cms.VPSet(
    cms.PSet(
        pathName = cms.string('path1'),
        prescales = cms.vuint32(1, 1)
    )
)

process.throwException = cms.EDProducer('ExceptionThrowingProducer',
    eventIDThrowOnEvent = badEventID
)

process.hltPrePath1 = cms.EDFilter( 'HLTPrescaler',
    offset = cms.uint32( 0 ),
    L1GtReadoutRecordTag = cms.InputTag( 'hltGtStage2Digis' )
)

process.filterAlwaysTrue1 = cms.EDFilter( 'HLTBool', result = cms.bool( True ))
process.filterAlwaysTrue2 = process.filterAlwaysTrue1.clone()
process.filterAlwaysTrue3 = process.filterAlwaysTrue1.clone()

process.intProducer1 = cms.EDProducer('ManyIntProducer', ivalue = cms.int32( 1 ))
process.intProducer2 = process.intProducer1.clone( ivalue = 2 )

process.path1 = cms.Path(
    process.throwException
  + process.filterAlwaysTrue1
  + process.hltPrePath1
  + process.filterAlwaysTrue2
  + process.intProducer1
  + process.filterAlwaysTrue3
)

#process.path2 = cms.Path(
#    process.throwException
#  + process.intProducer2
#)

process.theOutputModule = cms.OutputModule('PoolOutputModule',
    fileName = cms.untracked.string( 'cmssw41512_testException2.root' ),
    SelectEvents = cms.untracked.PSet( SelectEvents = cms.vstring(
        'path1',
#        'path2',
    )),
    outputCommands = cms.untracked.vstring( 'keep *' )
)

process.endpath1 = cms.EndPath(
    process.theOutputModule
)

process.options.wantSummary = True
process.options.SkipEvent = cms.untracked.vstring( 'IntentionalTestException' )

[5]

Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 07-May-2023 15:13:50.329 CEST
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 1, module = hltPrePath1, path = path1
%MSG-w HLT:  HLTPrescaler:hltPrePath1 07-May-2023 15:13:50 CEST Run: 1 Event: 1
Cannot read prescale column index from GT1 data: using default as defined by configuration or DAQ
%MSG
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue3, path = path1
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 07-May-2023 15:13:50.331 CEST
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue1, path = path1
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 07-May-2023 15:13:50.331 CEST
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 3, module = hltPrePath1, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue3, path = path1

TrigReport ---------- Event  Summary ------------
TrigReport Events total = 3 passed = 2 failed = 1

TrigReport ---------- Path   Summary ------------
TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
TrigReport     1    0          3          2          1          0 path1

TrigReport -------End-Path   Summary ------------
TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
TrigReport     0    0          3          3          0          0 endpath1

TrigReport ---------- Modules in Path: path1 ------------
TrigReport  Trig Bit#    Visited     Passed     Failed      Error Name
TrigReport     1    0          3          2          0          1 throwException
TrigReport     1    1          3          3          0          0 filterAlwaysTrue1
TrigReport     1    2          2          2          0          0 hltPrePath1
TrigReport     1    3          2          2          0          0 filterAlwaysTrue2
TrigReport     1    4          2          2          0          0 intProducer1
TrigReport     1    5          2          2          0          0 filterAlwaysTrue3

TrigReport ------ Modules in End-Path: endpath1 ------------
TrigReport  Trig Bit#    Visited     Passed     Failed      Error Name
TrigReport     0    0          3          3          0          0 theOutputModule

TrigReport ---------- Module Summary ------------
TrigReport    Visited   Executed     Passed     Failed      Error Name
TrigReport          3          3          3          0          0 TriggerResults
TrigReport          3          3          3          0          0 endpath1
TrigReport          3          3          3          0          0 filterAlwaysTrue1
TrigReport          2          2          2          0          0 filterAlwaysTrue2
TrigReport          2          2          2          0          0 filterAlwaysTrue3
TrigReport          2          2          2          0          0 hltPrePath1
TrigReport          2          2          2          0          0 intProducer1
TrigReport          3          3          3          0          0 path1
TrigReport          3          3          3          0          0 theOutputModule
TrigReport          3          3          2          0          1 throwException

makortel commented 1 year ago

What am I missing ?

The framework schedules all the producers and analyzers between two filters (say A and B) to be run concurrently, along with the latter filter (B). If the latter filter (B) accepts the event, the framework schedules the next group of producers and analyzers until (and including) the following filter, etc. The scheduled modules are run in some order that satisfies the data dependencies of the modules.

In the case of SkipEvent being set, the framework checks if it should stop processing only when the Path schedules the next group of modules to be run. For modules that have already been scheduled, but not yet run, only the presence of an exception from an earlier module in the data dependence chain is checked (i.e. the module is not run if an earlier module in the data dependence chain threw an exception).

Therefore in your example [1,3] the following module running order is valid

filterAlwaysTrue1
hltPrePath1
filterAlwaysTrue2
intProducer1
filterAlwaysTrue3
throwException
intProducer2

In the single thread case the module execution order gets dictated by the order the modules are scheduled, as the TBB's per-thread work queue acts like a stack. The paths are processed in the reverse order (i.e. modules in path2 are scheduled before modules in path1), and also for each group of eligible modules the modules are scheduled in reverse order (i.e. in path2 the intProducer2 is scheduled before throwException). From all these it follows that in absence of data dependencies all other modules of path1 get run before the throwException.

In the second example [4,5] the removal of path2 leads to throwException being scheduled as part of path1, and the module running order is

throwException
filterAlwaysTrue1

The filterAlwaysTrue1 is run because it was scheduled before the throwException threw the exception, and because of the exception the remaining modules in path1 are not scheduled.

missirol commented 1 year ago

Thanks for the clarification, @makortel ! (much needed)

makortel commented 1 year ago

An alternative could be to extend TriggerResultsFilter to handle the case where some Path is in an error state (by quick look it currently just checks if a Path was accepted), like being able to reject the Event if any Path is in an error state. Then you could use such TriggerResultsFilter in those EndPaths that you'd want to not run in case of SkipEvent exception.

After playing a bit with a test configuration (in the context of https://github.com/cms-sw/cmssw/issues/41645) it actually seems to that a module throwing an exception with category in the SkipEvent, the corresponding "aborted" Path is not set in error state, but in "failed" (or "rejected") state, so TriggerResultsFilter should work in this case out of the box.

missirol commented 1 year ago

the corresponding "aborted" Path is not set in error state, but in "failed" (or "rejected") state, so TriggerResultsFilter should work in this case out of the box.

I think I noticed that, but somehow I still couldn't get my example with skipEvents to behave as I was expecting; I might have to take a second look.

For the sake of documenting, I wrote in CMSHLT-2793 what I tried in the context of the problem that triggered this issue. In this particular case, I think one possibility would have been to use options.IgnoreCompletely: using that plus a few filters I think I was able to redirect the bad events to a separate output module, and to have them rejected by all other Paths (standard triggers).

Luckily, a recent update of the L1T menu seems to have fixed the root cause of the problem, and we haven't seen HLT crashes online in the last few days.

makortel commented 1 year ago

Following up on the "conceptual design" in https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1554925070 with the next, more concrete iteration of the plan

We will remove SkipEvent and FailPath and replace them with TryToContinue parameter, that behaves as described below
- We think the current behaviors of SkipEvent and FailPath, or behaviors more or less implied by their names, are not the most useful one
- The process.options.TryToContinue configuration parameter would be an untracked.vstring for exception categories
When a module A throws an exception whose category is in TryToContinue from its Event function (i.e. analyze/produce/filter/acquire)
- Any module depending on A's data products (directly or transitively) would not be run, unless explicitly requested (more on this further below)
- Each Path containing the module A, or any module depending on A, would be end up in "error" state, and no further modules in the Path would be scheduled to run. Modules that have already been scheduled and are independent of A would still be run.
- The failure of the module and Paths are recorded in TriggerResults with an "error" state
- Modules and Paths independent of A would be scheduled and run as usual
- The thrown exception would be available for any consuming module (that has been explicitly requested to be run if such exception has been thrown) via edm::Handle::whyFailed().
We add a new event data product (say ExceptionInfo) where we would store the labels of the module(s) that threw TryToContinue exceptions, and the exception category (behaviorally somewhat similar to TriggerResults)
- The framework produces ExceptionInfo after all modules that are not decorated with cms.catch() (see below) have been run (or known not to run because of the event filtering). In case of no exceptions were thrown, an "all ok" data product would be produced.
- The ExceptionInfo can be persisted
- We may later extend ExceptionInfo to contain partial or full exception message if that turns out be useful
Impact to module scheduling
- Modules that should be scheduled also when another module threw an exception would need to be decorated with cms.catch() when they are placed in EndPath or Task
  - It follows that modules that consume ExceptionInfo must be decorated with cms.catch()
  - This will be enforced by the framework. If violated the job terminates with an exception.
  - Placing cms.catch()-decorated modules into Path (either directly or via Sequence) is disallowed. If attempted the job terminates with an exception.
  - In case a cms.catch()-decorated module is put in more than one Sequence/Task/EndPath, and not all of them have the cms.catch() decoration, all such occurrences of the module are treated as if it were decorated with cms.catch() everywhere.
  - All exceptions from the Event function of cms.catch()-decorated module are ignored
  - We could allow a specific exception type to be thrown which the framework treats as Rethrow
  - We could allow specific exceptions (like std::bad_alloc) to be rethrown as well
- We would create a generic EDFilter (say ExceptionFilter) that would consume the ExceptionInfo, and accept the Event only if any TryToContinue exception was thrown
Behavior of OutputModules follows from the description above; some examples below
- If an OutputModule depends on A, it will not run for the exception Event, except if
- The OutputModule is decorated with cms.catch(), in which case it will run for all Events, except if
  - The OutputModule is in an EndPath after the ExceptionFilter, in which case it will run only for the exception Event
- If an OutputModule does not depend on A, it will run for all Events (including exception Event), unless
- The OutputModule is in an EndPath after the ExceptionFilter, in which case it will run only for the exception Event
- All Paths in the OutputModule's SelectEvents are in non-accept state (either fail or error), in which case it will not run for the Event
Every time an exception is ignored (with IgnoreCompletely or with TryToContinue), a log message is printed containing the exception message
We might want to log the ignored exceptions also in the framework job report
If any exception is thrown from Event processing whose category is not listed in TryToContinue, the job is terminated
Exceptions from other parts than module Event functions (e.g. Runs, Lumis, EventSetup, Source, OutputModule) would continue to result job termination

All the new names are preliminary (suggestions for better names are welcome).

Dr15Jones commented 1 year ago

PR #42441 implements the TryToContinue although changes were made to the behavior based on the initial attempt to write the code. The changes are

cms.catch was not used, instead one marks a module by calling the python method shouldTryToContinue(). This was done as we ultimately determined that the property we were trying to create is really a property of a module and NOT the property of how a module appears in a given Path.
We also found that modules marked with shouldTryToContinue() can be placed on Paths without a problem.
We also changed it so that if a module marked with shouldTryToContinue() throws an exception, the exception is only ignored if the exception is in the TryToContinue list. This avoids accidentally hiding infrastructure failures such as a std::bad_alloc exception.
the extra data product ExceptionInfo was not created. Trying to figure out exactly how to schedule the creation of the ExceptionInfo turned out to be extremely difficult (lots of potential circular dependencies could occur). It also turned out that the SelectEvents of an OutputModule already supports filtering on Paths which have seen an exception, via exception@<pathname>. When used in conjunction with TryToContinue one can trigger an OutputModule to be run when an exception happens and still continue Event processing. Such abilities should provide a mechanism for the HLT to create an error stream and keep known problems from stopping a job.

fwyzard commented 1 year ago

@Dr15Jones @makortel I found the initial description of the behaviour of the output modules a bit confusing.

In particular what does it mean that an output module depends on a module that throws an exception ? Does it mean that it has a matching keep statement ? Does it mean that it has a SelectEvents entry for a Path that includes that module ?

Could you summarise how the output modules should behave, also in view of later changes?

Dr15Jones commented 1 year ago

In particular what does it mean that an output module depends on a module that throws an exception ? Does it mean that it has a matching keep statement ?

Yes. It also means any modules needed by those modules which created the data products to 'keep'.

Does it mean that it has a SelectEvents entry for a Path that includes that module ?

Not exactly. A Path which sees an exception (either because a module on the Path throws the exception OR an unscheduled module needed by the module on the Path throws an exception) will be marked as having an 'error' status. The OutputModule will only run if at least one of its Paths doesn't see the exception and the Path succeeds as normal.

To illustrate, I've put together a small program using some dummy test modules. (NOTE: I modified AddIntsProducer so that it would ignore any data products which are missing from the Event).

The following is a trivialized representation of the HLT.

It only have 'Track' and 'Calo' related paths.
There is a 'globalTrigger' data product which records to output of the Filters at the end of the two paths. This is marked as shouldTryToContinue() so that even if one of the filters dependent on a module that fails, the globalTrigger will still run.
The outputs consist of
- one dependent only on trackPath
- one dependent only on caloPath
- one dependent on either trackPath or caloPath
- one that doesn't care about Paths and only cares about globalTrigger (to show how that work with exceptions)
- one that only runs if there was an exception
The 'Track' path has an exception happen early on as part of the 'track hits'.
```
import FWCore.ParameterSet.Config as cms
```

process = cms.Process("TEST")

process.source = cms.Source("EmptySource")

process.maxEvents.input = 3

this is the type thrown by FailingProducer

process.options.TryToContinue = ["NotFound"] process.options.wantSummary = True

process.trackingHits = cms.EDProducer("FailingProducer") process.tracks = cms.EDProducer("AddIntsProducer", labels = cms.VInputTag("trackingHits")) process.trackFilter = cms.EDFilter("IntProductFilter", label = cms.InputTag("tracks"), threshold = cms.int32(0), shouldProduce = cms.bool(True) )

process.caloClusters = cms.EDProducer("IntProducer", ivalue = cms.int32(1)) process.caloFilter = cms.EDFilter("IntProductFilter", label = cms.InputTag("caloClusters"), threshold = cms.int32(0), shouldProduce = cms.bool(True) )

process.globalTrigger = cms.EDProducer("AddIntsProducer", labels = cms.VInputTag("trackFilter","caloFilter")) process.globalTrigger.shouldTryToContinue()

process.trackPath = cms.Path(process.trackingHits+process.tracks+process.trackFilter) process.caloPath = cms.Path(process.caloClusters+process.caloFilter) process.globalTriggerPath = cms.Path(process.globalTrigger)

outputTemplate = cms.OutputModule("AsciiOutputModule", outputCommands = cms.untracked.vstring("drop *", "keep edmTriggerResults__*"), SelectEvents = cms.untracked.PSet(SelectEvents = cms.vstring()))

process.trackOut = outputTemplate.clone(SelectEvents = dict(SelectEvents=["trackPath"])) process.caloOut = outputTemplate.clone(SelectEvents = dict(SelectEvents=["caloPath"])) process.trackAndCaloOut = outputTemplate.clone(SelectEvents=dict(SelectEvents=["trackPath","caloPath"])) process.globalTriggerOut = outputTemplate.clone(outputCommands = ["drop ", "keep edmTriggerResults_*","keep _globalTrigger_TEST"]) process.exceptionOut = outputTemplate.clone(SelectEvents=dict(SelectEvents=["exception@*"]))

process.out = cms.EndPath(process.trackOut+process.caloOut+process.trackAndCaloOut+process.exceptionOut+process.globalTriggerOut)


when run, all the OutputModules write the 3 events except for the 'trackOut' which writes no events as the only path it depends upon never succeeds (i.e. trackPath is set to the error state for each Event).

From the summary we see

TrigReport ---------- Path Summary ------------ TrigReport Trig Bit# Executed Passed Failed Error Name TrigReport 1 0 3 0 0 3 trackPath TrigReport 1 1 3 3 0 0 caloPath TrigReport 1 2 3 3 0 0 globalTriggerPath

TrigReport ---------- Module Summary ------------ TrigReport Visited Executed Passed Failed Error Name TrigReport 3 3 3 0 0 TriggerResults TrigReport 3 3 3 0 0 caloClusters TrigReport 3 3 3 0 0 caloFilter TrigReport 3 3 3 0 0 caloOut TrigReport 3 3 3 0 0 caloPath TrigReport 3 3 3 0 0 exceptionOut TrigReport 3 3 3 0 0 globalTrigger TrigReport 3 3 3 0 0 globalTriggerOut TrigReport 3 3 3 0 0 globalTriggerPath TrigReport 3 3 3 0 0 out TrigReport 3 3 3 0 0 trackAndCaloOut TrigReport 3 0 0 0 3 trackFilter TrigReport 3 3 3 0 0 trackOut TrigReport 3 3 3 0 0 trackPath TrigReport 3 3 0 0 3 trackingHits TrigReport 3 0 0 0 3 tracks

fwyzard commented 1 year ago

hi Chris, thanks, I think the example clarifies better the expected behaviour, at least for me.

Dr15Jones commented 1 year ago

@fwyzard @missirol Matti and I chatted today and we are not convinced that the new shouldTryToContinue() python method makes for the best API. A couple concerns are making it work with dumpPython as well as what would happen if the module is cloned?

Given the shouldTryToContinue is related to how a job responds to an exception (and not to how the algorithm for a module should be run) we thought that specifying the list of modules in the process.options might be better. This matches how one specifies the delete early of data products. The addition to the options block would be a untracked.vstring() named something like modulesThatShouldTryToContinue. Alternatively, we could change the options block parameter TryToContinue to be a untracked.PSet with the parameters exceptions and modulesToContinue. The down side of the latter makes it very different from the other exception handling specifiers.

We were hoping to get your thoughts on either the shouldTryToContinue method to the Modules or the extension of the options block.

fwyzard commented 1 year ago

Let me add explicitly @Martin-Grunewald and @mmusich to the discussion.

Dr15Jones commented 1 year ago

ping @Martin-Grunewald @mmusich @missirol

missirol commented 1 year ago

Hi @Dr15Jones , thanks for these updates. I only have some naive feedback after reading the info above without testing #42441.

Fwiw, I agree with

specifying the list of modules in the process.options might be better.

One reason is that this makes it easy to control this in ConfDB (where we can already define the options PSet as needed, while I would not know how to handle .shouldTryToContinue()).

The addition to the options block would be a untracked.vstring() named something like modulesThatShouldTryToContinue. Alternatively, we could change the options block parameter TryToContinue to be a untracked.PSet with the parameters exceptions and modulesToContinue. The down side of the latter makes it very different from the other exception handling specifiers.

Regarding how to extend the options PSet, I don't see particular constraints from the HLT side, so I would leave this to Core-Sw to decide.

makortel commented 1 year ago

Regarding how to extend the options PSet, I don't see particular constraints from the HLT side, so I would leave this to Core-Sw to decide.

I'd actually be interested to learn how the process.options is configured for the HLT processes at online.

missirol commented 1 year ago

I'd actually be interested to learn how the process.options is configured for the HLT processes at online.

(Here's just what I think I know. Andrea or others can comment better.)

By default, the HLT menus in ConfDB do not have an options PSet. Online, a ConfDB menu is converted into a python config (I assume with hltConfigFromDB, or something very close to it), and DAQ appends to it a "patch" which originates from this template, where some of the parameters of process.options are changed (other than this, the defaults of options are used).

If necessary, it is possible to define a global PSet named options in a ConfDB configuration, where one would just specify the options parameters with non-default values [*] (but again, this is currently not done for any menu used in production).

[*] One example is

hltConfigFromDB --configName --adg /cdaq/test/missirol/test/2023/week18/CMSLITOPS_411/Test02/HLT/V5 > foo.py

which was used to test the use of options.IgnoreCompletely in the context of CMSHLT-2793.

Another example is

hltConfigFromDB --runNumber 355207 > bar.py # menu: /cdaq/physics/firstCollisions22/v4.0/HLT/V1

in which GPU offloading was disabled online by defining the following PSet in ConfDB

options = cms.untracked.PSet(  accelerators = cms.untracked.vstring( 'cpu' ) )

missirol commented 1 year ago

Trying to give more concrete feedback on #42441.

I took that PR, specifically ab11897d..106c6518 (used on top of 13_3_0_pre1), and tried to use it with a realistic HLT menu running on the problematic data that started this discussion (events where a cms::Exception is thrown by the L1T unpacker). This test is in [1]: the input file contains 100 events, one of which throws an exception of type InvalidGlobalAlgBlkBxCollection from the module hltGtStage2Digis (plugin type: L1TRawToDigi), and the script contains four cmsRun jobs ("hlt1", "hlt2", "hlt3" and "hlt4"); all jobs use the same (100) input events. The cmsRun logs of the four jobs are the following: hlt1.log, hlt2.log, hlt3.log and hlt4.log.

"hlt1" reproduces the issue (job stopped by exception).
"hlt2" includes "hlt1", and adds options.TryToContinue in order to skip the event throwing the exception.
- This works as expected (I think): the job completes, and the Paths containing the problematic module go in error state for the one problematic event.
"hlt3" includes "hlt2", and adds a new OutputModule meant to collect the event that throw the exception. "hlt3" tries to do so respecting the rules used to build OutputModules in the current HLT menus, rules which are enforced by the ConfDB GUI [2].
- In "hlt3", the trigger HLT_PathStatusError_v1 is meant to accept all events where at least one of the other triggers went into error state. HLT_PathStatusError_v1 is assigned to a Primary Dataset called HLTError, which is in turn assigned to a "stream" called HLTError. To implement HLT_PathStatusError_v1, I had to introduce a new EDFilter to check for the presence of Paths in error state (HLTPathStatusErrorFilter in https://github.com/missirol/cmssw/commit/b67f107964f8bd7710914ea36c007bd03b65ccca, heavily based on TriggerResultsFilter).
- "hlt3" does not work as intended, because the presence of the failing module (i.e. hltGtStage2Digis) inside the Paths HLT_PathStatusError_v1 and Dataset_HLTError makes those Paths themselves go in error state for the problematic event. As a result, the problematic event is not written out by hltOutputHLTError. This follows the description in [3], and it is a complication specific to the fact that the failing module is hltGtStage2Digis: the latter module must be included in those Paths in order to know the index of the prescale column to be used by the instance of HLTPrescaler in each of those Paths (in principle, hltGtStage2Digis could be removed from HLT_PathStatusError_v1, but the GUI would not allow to remove it from Dataset_HLTError).
- In addition, in "hlt3" some of the usual "TrigReport" messages are missing. This seems to be true for most of the Paths that end up in 'error' state in the problematic event. For example, the log of "hlt3" does not include lines like "Modules in Path: HLT_PathStatusError_v1 (also, the lines around Modules in Path: Dataset_EphemeralZeroBias13 look incorrect). Not sure how to interpret this.
"hlt4" is like "hlt3", except that modulesToCallForTryToContinue is set to an empty list. In this case, the output of cmsRun does include all the "TrigReport" messages (as far as I checked), e.g. lines like "Modules in Path: HLT_PathStatusError_v1".

Tentative summary.

It was checked that options.TryToContinue works as expected in an example based on a realistic HLT menu (except maybe what happens with the "TrigReport" messages, see comments on "hlt3").
The specific case of crashes coming from hltGtStage2Digis is particularly tricky to handle, given where this module needs to be used in the HLT menus. I did not find a way to send the problematic events to a dedicated dataset/stream using TryToContinue while complying with the structure of the HLT menus. I should note that CMSHLT-2793 (comment) showed a way to handle this particular problem using IgnoreCompletely (as opposed to TryToContinue), so CMSSW already allows to solve this at configuration level, and maybe TryToContinue (or the old SkipEvent) is just not the best choice for this specific type of crashes.

A few questions.

Is [3] necessary ? If the Path contains module A throwing exception B, and (1) B is in options.TryToContinue and (2) all the modules in the Path that depend on A are in options.modulesToCallForTryToContinue, is it necessary that the state of the Path is marked as 'error' ?
Are the "TrigReport" messages in "hlt3" and "hlt4" as expected ?
(minor) Do TryToContinue and/or modulesToCallForTryToContinue accept wildcards ? (if not, would that be possible and useful ?)

If you have suggestions/corrections, please let me know.

[1]

#!/bin/bash -ex

# cmsrel CMSSW_13_3_0_pre1
# cd CMSSW_13_3_0_pre1/src
# cmsenv
# git checkout -b test_cmssw42441
# git cms-merge-topic cms-sw:42441
# git cms-remote add missirol
# git cms-addpkg HLTrigger/HLTfilters
# git cp b67f107964f
# scram b -j 16

# run 366497, LS 196
INPUTFILE=root://eoscms.cern.ch//store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.root

RUNNUM=366497

run_test() {
  rm -rf run"${RUNNUM}"*
  convertToRaw -f 100 -l 100 -r "${RUNNUM}":196 -o . -- "${INPUTFILE}"
  edmConfigDump "${tmpfile}" > "${1}".py
  cmsRun "${1}".py &> "${1}".log || true
}

tmpfile=$(mktemp)
hltConfigFromDB --runNumber "${RUNNUM}" > "${tmpfile}"
cat <<@EOF >> "${tmpfile}"

process.load('run${RUNNUM}_cff')

# customisations to adapt 13_0_X HLT menus to CMSSW_13_3_0_pre1
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)

# remove prescales, and set GlobalTag
del process.PrescaleService
process.GlobalTag.globaltag = '130X_dataRun3_HLT_v2'

# show statistics on decisions of modules and Paths
process.options.wantSummary = True

# number of threads/streams used online by HLT
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF

# hlt1: reproduce online crash seen in run 366497 (LS 196)
jobLabel="hlt1"
run_test "${jobLabel}"

# hlt2: hlt1 + "try to continue" upon exceptions of type 'InvalidGlobalAlgBlkBxCollection'
jobLabel="hlt2"
cat <<@EOF >> "${tmpfile}"

process.options.TryToContinue = cms.untracked.vstring( 'InvalidGlobalAlgBlkBxCollection' )
@EOF
run_test "${jobLabel}"

# hlt3: hlt2 + send events to a separate PrimaryDataset and stream
jobLabel="hlt3"
cat <<@EOF >> "${tmpfile}"

# New Path to select events in which at least one other Path went in Error state
process.HLTBeginSequenceAny = cms.Sequence( process.hltGtStage2Digis )

process.hltPrePathStatusError = cms.EDFilter( "HLTPrescaler",
    offset = cms.uint32( 0 ),
    L1GtReadoutRecordTag = cms.InputTag( "hltGtStage2Digis" )
)

process.hltPathStatusErrorFilter = cms.EDFilter( "HLTPathStatusErrorFilter",
  ignoreInvalidPathNames = cms.bool(False),
  # consider all Paths in the configuration ..
  pathNames = cms.vstring( '*' ),
  # .. except for the Path holding this module, as well as DatasetPaths
  pathNamesToSkip = cms.vstring( 'HLT_PathStatusError_v*', 'Dataset_*' ),
)

process.HLT_PathStatusError_v1 = cms.Path(
    process.HLTBeginSequenceAny
  + process.hltPrePathStatusError
  + process.hltPathStatusErrorFilter
  + process.HLTEndSequence
)

# "DatasetPath": Path using TriggerResultsFilter to select on other Paths
process.hltPreDatasetHLTError = cms.EDFilter( "HLTPrescaler",
    offset = cms.uint32( 0 ),
    L1GtReadoutRecordTag = cms.InputTag( "hltGtStage2Digis" )
)

process.hltDatasetHLTError = cms.EDFilter( "TriggerResultsFilter",
    usePathStatus = cms.bool( True ),
    hltResults = cms.InputTag( "" ),
    l1tResults = cms.InputTag( "" ),
    l1tIgnoreMaskAndPrescale = cms.bool( False ),
    throw = cms.bool( True ),
    triggerConditions = cms.vstring( 'HLT_PathStatusError_v1' )
)

process.Dataset_HLTError = cms.Path(
    process.HLTDatasetPathBeginSequence
  + process.hltDatasetHLTError
  + process.hltPreDatasetHLTError
)

# "StreamPath": FinalPath with OutputModule selecting on DatasetPath
process.hltOutputHLTError = cms.OutputModule("GlobalEvFOutputModule",
    SelectEvents = cms.untracked.PSet(
        SelectEvents = cms.vstring(
            'Dataset_HLTError'
        )
    ),
    compression_algorithm = cms.untracked.string('ZSTD'),
    compression_level = cms.untracked.int32(3),
    lumiSection_interval = cms.untracked.int32(0),
    outputCommands = cms.untracked.vstring(
        'drop *',
        'keep FEDRawDataCollection_rawDataCollector_*_*',
        'keep FEDRawDataCollection_source_*_*',
        'keep GlobalObjectMapRecord_hltGtStage2ObjectMap_*_*',
        'keep edmTriggerResults_*_*_*',
        'keep triggerTriggerEvent_*_*_*'
    ),
    psetMap = cms.untracked.InputTag("hltPSetMap"),
    use_compression = cms.untracked.bool(True)
)

process.HLTErrorOutput = cms.FinalPath( process.hltOutputHLTError )

# update cms.Schedule adding new Path, DatasetPath, and StreamPath
process.schedule.extend([
    process.HLT_PathStatusError_v1,
    process.Dataset_HLTError,
    process.HLTErrorOutput
])

# update the global PSets "datasets" and "streams"
# (just to mimic the db->python converter of ConfDB)
process.datasets.HLTError = cms.vstring( 'HLT_PathStatusError_v1' )
process.streams.HLTError = cms.vstring( 'HLTError' )

# prevent the Paths HLT_PathStatusError_v1 and Dataset_HLTError
# from going into Error state themselves
process.options.modulesToCallForTryToContinue = cms.untracked.vstring(
    'hltPrePathStatusError',
    'hltPreDatasetHLTError'
)
@EOF
run_test "${jobLabel}"

# hlt4: hlt3 but with an empty modulesToCallForTryToContinue
jobLabel="hlt4"
cat <<@EOF >> "${tmpfile}"

process.options.modulesToCallForTryToContinue = cms.untracked.vstring()
@EOF
run_test "${jobLabel}"

[2] In the current HLT menus, every OutputModule is associated to one (and only one) so-called "stream" (a "stream" is a cms.FinalPath). Every "stream" (or rather, its OutputModule) selects the events accepted by the logical OR of a list of so-called "DatasetPaths" (a "DatasetPath" is a cms.Path; "DatasetPaths" represent the Primary Datasets used offline; every "DatasetPath" must be assigned to one, and only one, "stream"). Every "DatasetPath" uses the plugin TriggerResultsFilter to select events passing the logical OR of a set of triggers (a trigger is a cms.Path). This structure is apparent looking, for example, at this HLT config. This means that, in order to have an output file with a certain set of events, one must define (as a minimum) one "stream", a "DatasetPath" assigned to said "stream", and a trigger assigned to that "DatasetPath" and accepting the events of interest. Conversely, configuring an OutputModule to have SelectEvents = ["exception@*"] like in https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1664127254 is not currently possible in HLT menus built with the ConfDB GUI.

[3] (from https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1664127254)

A Path which sees an exception (either because a module on the Path throws the exception OR an unscheduled module needed by the module on the Path throws an exception) will be marked as having an 'error' status.

Dr15Jones commented 1 year ago

@missirol thanks very much for your very thorough report and attempting to use the TryToContinue mechanism.

Evidently we haven't made well known that the SelectEvents specification of OutputModules has supported from the very beginning the ability to trigger on paths that have an error. This is done by adding exception@ followed by the path name or *. Then if the path is in the error state, the OutputModule will be run.

Could you try an additional case where you get rid of your error path and try using @exception instead?

missirol commented 1 year ago

Could you try an additional case where you get rid of your error path and try using @exception instead?

Sure, I tried this in [1] ("hlt5"). The job completes, and it produces (among other things) a streamer file named

run366497/run366497_ls0196_streamHLTError_pid3075644.dat

which contains the one event where the exception is thrown (checked after repacking it). So, it looks to me like it's working. The corresponding cmsRun log is attached, hlt5.log.

[1]

#!/bin/bash -ex

# cmsrel CMSSW_13_3_0_pre1
# cd CMSSW_13_3_0_pre1/src
# cmsenv
# git checkout -b test_cmssw42441
# git cms-merge-topic cms-sw:42441
# git cms-remote add missirol
# git cms-addpkg HLTrigger/HLTfilters
# git cp b67f107964f
# scram b -j 16

# run 366497, LS 196
INPUTFILE=root://eoscms.cern.ch//store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.root

RUNNUM=366497

run_test() {
  rm -rf run"${RUNNUM}"*
  convertToRaw -f 100 -l 100 -r "${RUNNUM}":196 -o . -- "${INPUTFILE}"
  edmConfigDump "${tmpfile}" > "${1}".py
  cmsRun "${1}".py &> "${1}".log || true
}

tmpfile=$(mktemp)
hltConfigFromDB --runNumber "${RUNNUM}" > "${tmpfile}"
cat <<@EOF >> "${tmpfile}"

process.load('run${RUNNUM}_cff')

# customisations to adapt 13_0_X HLT menus to CMSSW_13_3_0_pre1
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)

# remove prescales, and set GlobalTag
del process.PrescaleService
process.GlobalTag.globaltag = '130X_dataRun3_HLT_v2'

# show statistics on decisions of modules and Paths
process.options.wantSummary = True

# number of threads/streams used online by HLT
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF

# hlt5: send events with at least one Path in error state to a separate output file, using the syntax "exception@*"
jobLabel="hlt5"
cat <<@EOF >> "${tmpfile}"

process.options.TryToContinue = cms.untracked.vstring( 'InvalidGlobalAlgBlkBxCollection' )

# "StreamPath": FinalPath with OutputModule selecting on DatasetPath
process.hltOutputHLTError = cms.OutputModule("GlobalEvFOutputModule",
    SelectEvents = cms.untracked.PSet(
        SelectEvents = cms.vstring(
            'exception@*'
        )
    ),
    compression_algorithm = cms.untracked.string('ZSTD'),
    compression_level = cms.untracked.int32(3),
    lumiSection_interval = cms.untracked.int32(0),
    outputCommands = cms.untracked.vstring(
        'drop *',
        'keep FEDRawDataCollection_rawDataCollector_*_*',
        'keep FEDRawDataCollection_source_*_*',
        'keep GlobalObjectMapRecord_hltGtStage2ObjectMap_*_*',
        'keep edmTriggerResults_*_*_*',
        'keep triggerTriggerEvent_*_*_*'
    ),
    psetMap = cms.untracked.InputTag("hltPSetMap"),
    use_compression = cms.untracked.bool(True)
)

process.HLTErrorOutput = cms.FinalPath( process.hltOutputHLTError )

# update cms.Schedule adding (Final)Path with new OutputModule
process.schedule.append( process.HLTErrorOutput )
@EOF
run_test "${jobLabel}"

Dr15Jones commented 1 year ago

Sure, I tried this in [1] ("hlt5"). The job completes, and it produces (among other things) a streamer file named

@missirol Thanks! [I am trying to build an area based on your recipe from earlier but compilation will take a long time :) ]

Is the TrigReport for this case also not what you expect? If so, could you explain what you did expect?

Dr15Jones commented 1 year ago

@missirol to better understand the TrigReport, does hltDatasetEphemeralZeroBias depend on the data product from hltGtStage2Digis ?

missirol commented 1 year ago

to better understand the TrigReport, does hltDatasetEphemeralZeroBias depend on the data product from hltGtStage2Digis ?

No, the module hltDatasetEphemeralZeroBias does not depend directly on hltGtStage2Digis. (here one can get an idea of the dependencies, those modules did not change significantly in the recent past).

Is the TrigReport for this case also not what you expect? ["hlt5", editor's note]

Yes, with the caveat that I'm not doing any detailed checks, but rather just quickly scanning thousands of lines by eye looking for anything odd.

If so, could you explain what you did expect?

I'm mostly looking at lines that contain TrigReport ---------- Modules in Path. I would expect to have one such line for every Path in the configuration. It looks like this is the case for all tests except for "hlt3.log" in https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1676475766. The only difference between "hlt3" and "hlt4" is that in the former the list modulesToCallForTryToContinue is not empty. "hlt3.log" contains lines like [1], which just seem messed up. I know I have seen this before, but unfortunately I do not remember well when. It might have been when I was playing around with SkipEvent, but I'm not sure at all.

[1]

TrigReport     1   27          0          0          0          0 hltSelectorCentralJets20L1FastJeta
TrigReport     1   43          0          0          0          0 hltFastPVJetVertexChecker
TrigReport     1   58          0          0          0          0 hltSelector8CentralJetsL1FastJet
TrigReport     1   74          0          0          0          0 hltIter1PFlowPixelHitDoubletsForDisplaced
TrigReport     1   89          0          0          0          0 hltIter2PFlowPixelClusterCheckForDisplaced
TrigReport ---------- Modules in Path: HLT_L1Tau_DelayedJet40_SingleDelay3p5nsInclusive_v2 ------------
TrigReport     1   38          2          2          0          0 hltHbhereco
TrigReport     1   11          2          2          0          0 hltAK4CaloJets
TrigReport     1   25          2          0          2          0 hltCaloJetTimingFilterSingleTau3p5ns
TrigReport  Trig Bit#    Visited     Passed     Failed      Error Name
TrigReport     1  132          0          0          0          0 hltHbhereco
TrigReport     1   12          0          0          0          0 hltAK4CaloJetsIDPassed
TrigReport     1   27          0          0          0          0 hltSelectorCentralJets20L1FastJeta
TrigReport     1   40          0          0          0          0 hltFastPVPixelTracksHitQuadruplets
TrigReport     1   56          0          0          0          0 hltSelectorJets30L1FastJet
TrigReport     1   70          0          0          0          0 hltIter1MaskedMeasurementTrackerEventForDisplaced
TrigReport     1   85          0          0          0          0 hltIter2ClustersRefRemovalForDisplaced
TrigReport     1  100          0          0          0          0 hltL4PromptDisplacedDijetFullTracksJetTagProducerFromIPLow30Pt1PtrkShortSig5
TrigReport     1  115          0          0          0          0 hltIter4MergedWithIter012DisplacedJets

missirol commented 1 year ago

If so, could you explain what you did expect?

Another tiny example below. I think all are correct except for "hlt3", which is clearly off.

> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt1.log | wc -l
89

> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt2.log | wc -l
89

> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt3.log | wc -l
2

> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt4.log | wc -l
90

> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt5.log | wc -l
89

Dr15Jones commented 1 year ago

So for the weirdness in hlt3.log, the only way I can even beginning to think how that printout could happen is if multiple threads were simultaneously running endJob. If that is true, it would be very BAD.

I'll definitely look into.

Be that as it may, does the configuration in hlt5 suit your needs?

Dr15Jones commented 1 year ago

On a quick look, another thing different between hlt3.log and hlt5.log is the number of message logger messages that were skipped. For hlt3.log it was around 57,000 while hlt5.log only 150. It is possible that endJob is NOT running concurrently, but instead another task is running concurrently with endJob and that task is output a lot of stuff to the message logger. If that were the case, then the hlt3.log could have some of its messages dropped in the TrigReport and that is what causes the weird looking output.

Maybe adding the Tracer service to the job might uncover the behavior.

missirol commented 1 year ago

Be that as it may, does the configuration in hlt5 suit your needs?

Not fully, in the sense that "hlt5" uses exception@* in SelectEvents and that syntax cannot be used (as of now) in the HLT menus we use in production, as those follow a rather stiff structure when it comes to how triggers, datasets and output modules are defined (this is what I briefly described in note [2] of https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1676475766).

As I was trying to explain in https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1676475766, "hlt3" follows the HLT "rules", it uses TryToContinue, and it is likely a good solution for skipping "excepted" events and sending them to a dedicated stream for most types of exceptions , just not those coming from hltGtStage2Digis. The latter case is more complicated, because hltGtStage2Digis must be inside every so-called "DatasetPath", so an exception in hltGtStage2Digis gets every "DatasetPath" in error state.

Dr15Jones commented 1 year ago

@missirol what are the odds that the ConfDB limitations could be lifted? Seems like those might be more of a hinderance than a help for dealing with special cases.

missirol commented 1 year ago

In general, bad odds (as usual with ConfDB), but I'm not sure this particular case comes down to just a limitation of ConfDB.

The GUI and the db can often be limiting factors when it comes to supporting new features of CMSSW inside HLT menus, but here ConfDB is simply enforcing the long-standing paradigm of having triggers assigned to datasets, and datasets assigned to streams. Having an OutputModule with SelectEvents = ["exception@*"] (a stream of events without specific datasets assigned to it) would be an.. exception to this paradigm. Maybe it's not impossible, but it's something that requires careful consideration (the actual HLT experts are free to comment, of course).

My current take is that what the framework will provide after #42441 is sufficient for HLT.

#42441 basically fixes the old SkipEvent, and TryToContinue is what HLT could use to skip the vast majority of cms::Exceptions (pretty much any exception that does not come from L1TRawToDigi, TriggerResultsFilter or HLTPrescaler; I think exceptions from the last two are truly unlikely, if even possible).
For the case of crashes from L1TRawToDigi, I think they could be circumvented in a different way (e.g. IgnoreCompletely).

Let me add that, based on discussions had in TSG, the likelihood of HLT deciding to bypass exceptions in production is very low. Skipping events at HLT is seen as an extreme measure, because doing so removes pressure from finding the actual solution to the problem (where the problem is usually outside of HLT, like this year in the case of corrupted data from L1T causing crashes when unpacked at HLT).

Dr15Jones commented 1 year ago

Let me add that, based on discussions had in TSG, the likelihood of HLT deciding to bypass exceptions in production is very low. Skipping events at HLT is seen as an extreme measure, because doing so removes pressure from finding the actual solution to the problem (where the problem is usually outside of HLT, like this year in the case of corrupted data from L1T causing crashes when unpacked at HLT).

From internal discussions with the framework team, we also see any use of TryToContinue or IgnoreCompletely as a short-time use until the actually underly problem can be fixed item and not something that should be used for standard workflows long term.

Dr15Jones commented 1 year ago

I think I probably determined why the TrigReport had a problem. It looks like messages are being dropped while the summary is being printed. I found that if I turned on INFO reporting, then FastMonitoringService is periodically printing INFO messages during endJob.

It looks like some times the FastMonitoringService printing might be slowed down (e.g. operating system temporarily halting that thread) and then since that thread becomes temporarily in-charge of printing all MessageLogger messages, the huge flood of messages from the summary overwhelms the queue and it starts to drop messages.

The message appears to be coming from here

https://github.com/cms-sw/cmssw/blob/0c9536a3aa9cc8248b6a95cc89762823879bd8cf/EventFilter/Utilities/src/FastMonitoringService.cc#L821-L838

Dr15Jones commented 1 year ago

So after modifying FastMonitoringService to have an option to not do LogInfo on its own thread and using that option, I did not see any loss of TrigReport messages for 10 runs of the job. Previously I'd see some losses after about 3 runs of the job.

makortel commented 3 months ago

+core

I think this issue can be closed by now

makortel commented 3 months ago

@cmsbuild, please close

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

This issue is fully signed and ready to be closed.

cms-sw / cmssw

how to skip events which throw a `cms::Exception` #41512

this is the type thrown by FailingProducer