Closed missirol closed 3 months ago
A new Issue was created by @missirol Marino Missiroli.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
Is there a specific reason for these consuming modules (whose ProductNotFound
exception you'd like to ignore) to be in an EndPath?
I think so.
The "consuming" modules that can throw a "ProductNotFound" in this case are instances of TriggerRatesMonitor
and TriggerBxMonitor
. I think they are on a EndPath in the current HLT menu in order to consume the TriggerResults
of the process they are in (meaning, cms.Process("HLT")
).
I'm thinking that for these DQM-oriented modules a LogError
might be more appropriate than an Exception, following what we already do in a similar case, i.e. L1TriggerJSONMonitoring.cc#L244.
Just to add something which is not obvious: the EndPaths that contain instances of TriggerRatesMonitor
, TriggerBxMonitor
and *TriggerJSONMonitoring
are not part of the "offline" HLT menus that are included in CMSSW under HLTrigger/Configuration
. Those EndPaths are only part of the HLT menus that we actually run online (for example, they produce outputs used by the online DQM).
It seems like you have explored the limits of the framework's exception handling mechanisms. I should note that the behavior of SkipEvent
has not been stress-tested or used in production (for a long time at least), so some unexpected behavior might occur.
Another caveat is that are the relevant modules exception safe? I mean, we typically treat (most) exceptions as fatal and shut down the framework, in which case module e.g. managing resources in non-RAII way is not too bad in practice (as the OS cleans up the memory and file handles anyway). But continuing data processing with an exception thrown from a non-exception-safe can result undesirable behavior later in the process. (I didn't check either of the modules in question here).
In the short term I'd look into making the TriggerRatestMonitor
, possibly also the L1TRawToDigi
, to not throw exceptions (in these cases, at least; I though HLT had a policy or desire to avoid exceptions when processing can continue?) in the first place.
An alternative could be to extend TriggerResultsFilter
to handle the case where some Path is in an error state (by quick look it currently just checks if a Path was accepted), like being able to reject the Event if any Path is in an error state. Then you could use such TriggerResultsFilter
in those EndPaths that you'd want to not run in case of SkipEvent
exception.
Another caveat is that are the relevant modules exception safe?
Just to clarify: the module is not destroyed, it "just" exits from produce()
when the exception is thrown. Correct ?
In the short term I'd look into making [...]
L1TRawToDigi
, to not throw exceptions
Do you mean to change L1TRawToDigi
to - for example - issue a LogError
and not produce any collection, instead of throwing an exception ?
That could work, but in turn it would require to change all downstream consumers to check if the collections are actually available, and so on.
Another caveat is that are the relevant modules exception safe?
Just to clarify: the module is not destroyed, it "just" exits from
produce()
when the exception is thrown. Correct ?
Correct.
In the short term I'd look into making [...]
L1TRawToDigi
, to not throw exceptionsDo you mean to change
L1TRawToDigi
to - for example - issue aLogError
and not produce any collection, instead of throwing an exception ? That could work, but in turn it would require to change all downstream consumers to check if the collections are actually available, and so on.
Yes, or produce an empty data product.
In the short term I'd look into making the TriggerRatestMonitor, possibly also the L1TRawToDigi, to not throw exceptions (in these cases, at least; I though HLT had a policy or desire to avoid exceptions when processing can continue?) in the first place.
For TriggerRatestMonitor
, I agree (this goes along the lines of https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1533523721), and it is easy enough for me to implement.
For L1TRawToDigi
, I'm less certain. Implementing changes to it likely requires more L1T-sw expertise than I have. In addition, I think that one concern is that not throwing an exception when 'something' fundamentally wrong happens in L1TRawToDigi
(e.g. corrupted data) could make it more difficult to (1) realise that there is a problem online, and (2) push experts to fix the root cause of the issue (the possibility of actually using skipEvent
online is only being contemplated now, after many days with hundreds of HLT crashes, and obviously it would be a temporary measure).
An alternative could be to extend
TriggerResultsFilter
to handle the case where some Path is in an error state
Thanks, that's an interesting suggestion (I hadn't thought about it).
Aside from the complications related to EndPaths, I think I still need to understand better what happens in cms.Paths
only, when there is a module that throws a cms::Exception
and one tries to handle it with options.SkipEvent
.
I'm trying to understand what happens in the reproducer in [1].
options.SkipEvents
is configured to skip events with that Exception.::filter
method of HLTPrescaler
and HLTBool
[2] to see if that is accessed in the 'bad' event.TrigReport
suggests that modules like filterAlwaysTrue1
, intProducer1
and intProducer2
"Passed" for all 3 events, and printouts confirm that ::filter
ran in event-2 for modules like filterAlwaysTrue1
and filterAlwaysTrue2
(while I would have expected Visited=Passed=2
for those modules).I tried a 2nd reproducer [4], which is like [1] except for the fact that the 2nd Path is removed (so, there is only 1 Path, and 1 EndPath). The output of the 2nd reproducer is in [5], and it is stranger.
TrigReport
info of path1
changes, although all I did was to remove path2
without touching path1
: TrigReport
says that the filter right after throwException
, i.e. filterAlwaysTrue1
, ran for all 3 events, while the ensuing modules in path1
only ran for 2.throwException
did not throw.I don't understand most of what happens in path1
. What am I missing ?
[1] cmssw41512_testException1_cfg.py
import FWCore.ParameterSet.Config as cms
# Configuration file based on FWCore/Integration/test/testFrameworkExceptionHandling_cfg.py
badEventID = cms.untracked.EventID(1, 1, 2)
nStreams = 1
nRuns = 1
nLumisPerRun = 1
nEventsPerLumi = 3
nEventsPerRun = nLumisPerRun*nEventsPerLumi
nLumis = nRuns*nLumisPerRun
nEvents = nRuns*nEventsPerRun
process = cms.Process('TEST')
process.source = cms.Source('EmptySource',
firstRun = cms.untracked.uint32(1),
firstLuminosityBlock = cms.untracked.uint32(1),
firstEvent = cms.untracked.uint32(1),
numberEventsInLuminosityBlock = cms.untracked.uint32(nEventsPerLumi),
numberEventsInRun = cms.untracked.uint32(nEventsPerRun),
)
process.maxEvents = cms.untracked.PSet(
input = cms.untracked.int32(nEvents)
)
process.options = cms.untracked.PSet(
numberOfThreads = cms.untracked.uint32(nStreams),
numberOfStreams = cms.untracked.uint32(nStreams),
numberOfConcurrentRuns = cms.untracked.uint32(1),
numberOfConcurrentLuminosityBlocks = cms.untracked.uint32(2)
)
process.load('FWCore.PrescaleService.PrescaleService_cfi')
process.PrescaleService.lvl1Labels = ['Col0', 'Col1']
process.PrescaleService.lvl1DefaultLabel = 'Col0'
process.PrescaleService.prescaleTable = cms.VPSet(
cms.PSet(
pathName = cms.string('path1'),
prescales = cms.vuint32(1, 1)
)
)
process.throwException = cms.EDProducer('ExceptionThrowingProducer',
eventIDThrowOnEvent = badEventID
)
process.hltPrePath1 = cms.EDFilter( 'HLTPrescaler',
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( 'hltGtStage2Digis' )
)
process.filterAlwaysTrue1 = cms.EDFilter( 'HLTBool', result = cms.bool( True ))
process.filterAlwaysTrue2 = process.filterAlwaysTrue1.clone()
process.filterAlwaysTrue3 = process.filterAlwaysTrue1.clone()
process.intProducer1 = cms.EDProducer('ManyIntProducer', ivalue = cms.int32( 1 ))
process.intProducer2 = process.intProducer1.clone( ivalue = 2 )
process.path1 = cms.Path(
process.throwException
+ process.filterAlwaysTrue1
+ process.hltPrePath1
+ process.filterAlwaysTrue2
+ process.intProducer1
+ process.filterAlwaysTrue3
)
process.path2 = cms.Path(
process.throwException
+ process.intProducer2
)
process.theOutputModule = cms.OutputModule('PoolOutputModule',
fileName = cms.untracked.string( 'cmssw41512_testException1.root' ),
SelectEvents = cms.untracked.PSet( SelectEvents = cms.vstring(
'path1',
'path2',
)),
outputCommands = cms.untracked.vstring( 'keep *' )
)
process.endpath1 = cms.EndPath(
process.theOutputModule
)
process.options.wantSummary = True
process.options.SkipEvent = cms.untracked.vstring( 'IntentionalTestException' )
[2]
diff --git a/HLTrigger/HLTcore/plugins/HLTPrescaler.cc b/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
index 8841b0c3024..4c7e63e6646 100644
--- a/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
+++ b/HLTrigger/HLTcore/plugins/HLTPrescaler.cc
@@ -68,6 +68,9 @@ void HLTPrescaler::beginLuminosityBlock(edm::LuminosityBlock const& lb, edm::Eve
bool HLTPrescaler::filter(edm::Event& iEvent, const edm::EventSetup&) {
// during the first event of a LumiSection, read from the GT the prescale index for this
// LumiSection and get the corresponding prescale factor from the PrescaleService
+
+ edm::LogPrint("HLTPrescaler") << "HLTPrescaler -- " << iEvent.id() << ", module = " << moduleDescription().moduleLabel() << ", path = " << iEvent.moduleCallingContext()->placeInPathContext()->pathContext()->pathName();
+
if (newLumi_) {
newLumi_ = false;
diff --git a/HLTrigger/HLTfilters/plugins/BuildFile.xml b/HLTrigger/HLTfilters/plugins/BuildFile.xml
index cb29742c3a0..12f2cba6bc7 100644
--- a/HLTrigger/HLTfilters/plugins/BuildFile.xml
+++ b/HLTrigger/HLTfilters/plugins/BuildFile.xml
@@ -18,6 +18,7 @@
<use name="FWCore/Framework"/>
<use name="FWCore/MessageLogger"/>
<use name="FWCore/ParameterSet"/>
+<use name="FWCore/ServiceRegistry"/>
<use name="FWCore/Utilities"/>
<use name="HLTrigger/HLTcore"/>
<flags EDM_PLUGIN="1"/>
diff --git a/HLTrigger/HLTfilters/plugins/HLTBool.cc b/HLTrigger/HLTfilters/plugins/HLTBool.cc
index 1d229804150..dafe18de3f5 100644
--- a/HLTrigger/HLTfilters/plugins/HLTBool.cc
+++ b/HLTrigger/HLTfilters/plugins/HLTBool.cc
@@ -12,6 +12,10 @@
#include "FWCore/ParameterSet/interface/ConfigurationDescriptions.h"
#include "FWCore/ParameterSet/interface/ParameterSetDescription.h"
+#include "FWCore/ServiceRegistry/interface/PathContext.h"
+#include "FWCore/ServiceRegistry/interface/PlaceInPathContext.h"
+#include "FWCore/ServiceRegistry/interface/ModuleCallingContext.h"
+
//
// constructors and destructor
//
@@ -32,4 +36,9 @@ void HLTBool::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
//
// ------------ method called to produce the data ------------
-bool HLTBool::filter(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const { return result_; }
+bool HLTBool::filter(edm::StreamID, edm::Event& event, edm::EventSetup const& setup) const {
+
+ edm::LogPrint("HLTBool") << "HLTBool -- " << event.id() << ", module = " << moduleDescription().moduleLabel() << ", path = " << event.moduleCallingContext()->placeInPathContext()->pathContext()->pathName();
+
+ return result_;
+}
[3]
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 07-May-2023 15:13:29.405 CEST
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 1, module = hltPrePath1, path = path1
%MSG-w HLT: HLTPrescaler:hltPrePath1 07-May-2023 15:13:29 CEST Run: 1 Event: 1
Cannot read prescale column index from GT1 data: using default as defined by configuration or DAQ
%MSG
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue3, path = path1
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 07-May-2023 15:13:29.409 CEST
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 2, module = hltPrePath1, path = path1
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue3, path = path1
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 07-May-2023 15:13:29.410 CEST
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 3, module = hltPrePath1, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue3, path = path1
TrigReport ---------- Event Summary ------------
TrigReport Events total = 3 passed = 2 failed = 1
TrigReport ---------- Path Summary ------------
TrigReport Trig Bit# Executed Passed Failed Error Name
TrigReport 1 0 3 2 1 0 path1
TrigReport 1 1 3 2 1 0 path2
TrigReport -------End-Path Summary ------------
TrigReport Trig Bit# Executed Passed Failed Error Name
TrigReport 0 0 3 3 0 0 endpath1
TrigReport ---------- Modules in Path: path1 ------------
TrigReport Trig Bit# Visited Passed Failed Error Name
TrigReport 1 0 3 2 0 1 throwException
TrigReport 1 1 3 3 0 0 filterAlwaysTrue1
TrigReport 1 2 3 3 0 0 hltPrePath1
TrigReport 1 3 3 3 0 0 filterAlwaysTrue2
TrigReport 1 4 3 3 0 0 intProducer1
TrigReport 1 5 3 3 0 0 filterAlwaysTrue3
TrigReport ---------- Modules in Path: path2 ------------
TrigReport Trig Bit# Visited Passed Failed Error Name
TrigReport 1 0 3 2 0 1 throwException
TrigReport 1 1 3 3 0 0 intProducer2
TrigReport ------ Modules in End-Path: endpath1 ------------
TrigReport Trig Bit# Visited Passed Failed Error Name
TrigReport 0 0 3 3 0 0 theOutputModule
TrigReport ---------- Module Summary ------------
TrigReport Visited Executed Passed Failed Error Name
TrigReport 3 3 3 0 0 TriggerResults
TrigReport 3 3 3 0 0 endpath1
TrigReport 3 3 3 0 0 filterAlwaysTrue1
TrigReport 3 3 3 0 0 filterAlwaysTrue2
TrigReport 3 3 3 0 0 filterAlwaysTrue3
TrigReport 3 3 3 0 0 hltPrePath1
TrigReport 3 3 3 0 0 intProducer1
TrigReport 3 3 3 0 0 intProducer2
TrigReport 3 3 3 0 0 path1
TrigReport 3 3 3 0 0 path2
TrigReport 3 3 3 0 0 theOutputModule
TrigReport 6 3 2 0 1 throwException
[4] cmssw41512_testException2_cfg.py
import FWCore.ParameterSet.Config as cms
# Configuration file based on FWCore/Integration/test/testFrameworkExceptionHandling_cfg.py
badEventID = cms.untracked.EventID(1, 1, 2)
nStreams = 1
nRuns = 1
nLumisPerRun = 1
nEventsPerLumi = 3
nEventsPerRun = nLumisPerRun*nEventsPerLumi
nLumis = nRuns*nLumisPerRun
nEvents = nRuns*nEventsPerRun
process = cms.Process('TEST')
process.source = cms.Source('EmptySource',
firstRun = cms.untracked.uint32(1),
firstLuminosityBlock = cms.untracked.uint32(1),
firstEvent = cms.untracked.uint32(1),
numberEventsInLuminosityBlock = cms.untracked.uint32(nEventsPerLumi),
numberEventsInRun = cms.untracked.uint32(nEventsPerRun),
)
process.maxEvents = cms.untracked.PSet(
input = cms.untracked.int32(nEvents)
)
process.options = cms.untracked.PSet(
numberOfThreads = cms.untracked.uint32(nStreams),
numberOfStreams = cms.untracked.uint32(nStreams),
numberOfConcurrentRuns = cms.untracked.uint32(1),
numberOfConcurrentLuminosityBlocks = cms.untracked.uint32(2)
)
process.load('FWCore.PrescaleService.PrescaleService_cfi')
process.PrescaleService.lvl1Labels = ['Col0', 'Col1']
process.PrescaleService.lvl1DefaultLabel = 'Col0'
process.PrescaleService.prescaleTable = cms.VPSet(
cms.PSet(
pathName = cms.string('path1'),
prescales = cms.vuint32(1, 1)
)
)
process.throwException = cms.EDProducer('ExceptionThrowingProducer',
eventIDThrowOnEvent = badEventID
)
process.hltPrePath1 = cms.EDFilter( 'HLTPrescaler',
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( 'hltGtStage2Digis' )
)
process.filterAlwaysTrue1 = cms.EDFilter( 'HLTBool', result = cms.bool( True ))
process.filterAlwaysTrue2 = process.filterAlwaysTrue1.clone()
process.filterAlwaysTrue3 = process.filterAlwaysTrue1.clone()
process.intProducer1 = cms.EDProducer('ManyIntProducer', ivalue = cms.int32( 1 ))
process.intProducer2 = process.intProducer1.clone( ivalue = 2 )
process.path1 = cms.Path(
process.throwException
+ process.filterAlwaysTrue1
+ process.hltPrePath1
+ process.filterAlwaysTrue2
+ process.intProducer1
+ process.filterAlwaysTrue3
)
#process.path2 = cms.Path(
# process.throwException
# + process.intProducer2
#)
process.theOutputModule = cms.OutputModule('PoolOutputModule',
fileName = cms.untracked.string( 'cmssw41512_testException2.root' ),
SelectEvents = cms.untracked.PSet( SelectEvents = cms.vstring(
'path1',
# 'path2',
)),
outputCommands = cms.untracked.vstring( 'keep *' )
)
process.endpath1 = cms.EndPath(
process.theOutputModule
)
process.options.wantSummary = True
process.options.SkipEvent = cms.untracked.vstring( 'IntentionalTestException' )
[5]
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 07-May-2023 15:13:50.329 CEST
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 1, module = hltPrePath1, path = path1
%MSG-w HLT: HLTPrescaler:hltPrePath1 07-May-2023 15:13:50 CEST Run: 1 Event: 1
Cannot read prescale column index from GT1 data: using default as defined by configuration or DAQ
%MSG
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 1, module = filterAlwaysTrue3, path = path1
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 07-May-2023 15:13:50.331 CEST
HLTBool -- run: 1 lumi: 1 event: 2, module = filterAlwaysTrue1, path = path1
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 0 at 07-May-2023 15:13:50.331 CEST
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue1, path = path1
HLTPrescaler -- run: 1 lumi: 1 event: 3, module = hltPrePath1, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue2, path = path1
HLTBool -- run: 1 lumi: 1 event: 3, module = filterAlwaysTrue3, path = path1
TrigReport ---------- Event Summary ------------
TrigReport Events total = 3 passed = 2 failed = 1
TrigReport ---------- Path Summary ------------
TrigReport Trig Bit# Executed Passed Failed Error Name
TrigReport 1 0 3 2 1 0 path1
TrigReport -------End-Path Summary ------------
TrigReport Trig Bit# Executed Passed Failed Error Name
TrigReport 0 0 3 3 0 0 endpath1
TrigReport ---------- Modules in Path: path1 ------------
TrigReport Trig Bit# Visited Passed Failed Error Name
TrigReport 1 0 3 2 0 1 throwException
TrigReport 1 1 3 3 0 0 filterAlwaysTrue1
TrigReport 1 2 2 2 0 0 hltPrePath1
TrigReport 1 3 2 2 0 0 filterAlwaysTrue2
TrigReport 1 4 2 2 0 0 intProducer1
TrigReport 1 5 2 2 0 0 filterAlwaysTrue3
TrigReport ------ Modules in End-Path: endpath1 ------------
TrigReport Trig Bit# Visited Passed Failed Error Name
TrigReport 0 0 3 3 0 0 theOutputModule
TrigReport ---------- Module Summary ------------
TrigReport Visited Executed Passed Failed Error Name
TrigReport 3 3 3 0 0 TriggerResults
TrigReport 3 3 3 0 0 endpath1
TrigReport 3 3 3 0 0 filterAlwaysTrue1
TrigReport 2 2 2 0 0 filterAlwaysTrue2
TrigReport 2 2 2 0 0 filterAlwaysTrue3
TrigReport 2 2 2 0 0 hltPrePath1
TrigReport 2 2 2 0 0 intProducer1
TrigReport 3 3 3 0 0 path1
TrigReport 3 3 3 0 0 theOutputModule
TrigReport 3 3 2 0 1 throwException
What am I missing ?
The framework schedules all the producers and analyzers between two filters (say A and B) to be run concurrently, along with the latter filter (B). If the latter filter (B) accepts the event, the framework schedules the next group of producers and analyzers until (and including) the following filter, etc. The scheduled modules are run in some order that satisfies the data dependencies of the modules.
In the case of SkipEvent
being set, the framework checks if it should stop processing only when the Path schedules the next group of modules to be run. For modules that have already been scheduled, but not yet run, only the presence of an exception from an earlier module in the data dependence chain is checked (i.e. the module is not run if an earlier module in the data dependence chain threw an exception).
Therefore in your example [1,3] the following module running order is valid
filterAlwaysTrue1
hltPrePath1
filterAlwaysTrue2
intProducer1
filterAlwaysTrue3
throwException
intProducer2
In the single thread case the module execution order gets dictated by the order the modules are scheduled, as the TBB's per-thread work queue acts like a stack. The paths are processed in the reverse order (i.e. modules in path2
are scheduled before modules in path1
), and also for each group of eligible modules the modules are scheduled in reverse order (i.e. in path2
the intProducer2
is scheduled before throwException
). From all these it follows that in absence of data dependencies all other modules of path1
get run before the throwException
.
In the second example [4,5] the removal of path2
leads to throwException
being scheduled as part of path1
, and the module running order is
throwException
filterAlwaysTrue1
The filterAlwaysTrue1
is run because it was scheduled before the throwException
threw the exception, and because of the exception the remaining modules in path1
are not scheduled.
Thanks for the clarification, @makortel ! (much needed)
An alternative could be to extend
TriggerResultsFilter
to handle the case where some Path is in an error state (by quick look it currently just checks if a Path was accepted), like being able to reject the Event if any Path is in an error state. Then you could use suchTriggerResultsFilter
in those EndPaths that you'd want to not run in case ofSkipEvent
exception.
After playing a bit with a test configuration (in the context of https://github.com/cms-sw/cmssw/issues/41645) it actually seems to that a module throwing an exception with category in the SkipEvent
, the corresponding "aborted" Path is not set in error state, but in "failed" (or "rejected") state, so TriggerResultsFilter
should work in this case out of the box.
the corresponding "aborted" Path is not set in error state, but in "failed" (or "rejected") state, so TriggerResultsFilter should work in this case out of the box.
I think I noticed that, but somehow I still couldn't get my example with skipEvents
to behave as I was expecting; I might have to take a second look.
For the sake of documenting, I wrote in CMSHLT-2793 what I tried in the context of the problem that triggered this issue. In this particular case, I think one possibility would have been to use options.IgnoreCompletely
: using that plus a few filters I think I was able to redirect the bad events to a separate output module, and to have them rejected by all other Paths (standard triggers).
Luckily, a recent update of the L1T menu seems to have fixed the root cause of the problem, and we haven't seen HLT crashes online in the last few days.
Following up on the "conceptual design" in https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1554925070 with the next, more concrete iteration of the plan
We will remove SkipEvent
and FailPath
and replace them with TryToContinue
parameter, that behaves as described below
SkipEvent
and FailPath
, or behaviors more or less implied by their names, are not the most useful oneprocess.options.TryToContinue
configuration parameter would be an untracked.vstring
for exception categoriesWhen a module A
throws an exception whose category is in TryToContinue
from its Event function (i.e. analyze/produce/filter/acquire)
A
's data products (directly or transitively) would not be run, unless explicitly requested (more on this further below)A
, or any module depending on A
, would be end up in "error" state, and no further modules in the Path would be scheduled to run. Modules that have already been scheduled and are independent of A
would still be run. TriggerResults
with an "error" stateA
would be scheduled and run as usualedm::Handle::whyFailed()
. We add a new event data product (say ExceptionInfo
) where we would store the labels of the module(s) that threw TryToContinue
exceptions, and the exception category (behaviorally somewhat similar to TriggerResults
)
ExceptionInfo
after all modules that are not decorated with cms.catch()
(see below) have been run (or known not to run because of the event filtering). In case of no exceptions were thrown, an "all ok" data product would be produced.ExceptionInfo
can be persistedExceptionInfo
to contain partial or full exception message if that turns out be usefulImpact to module scheduling
cms.catch()
when they are placed in EndPath or Task
ExceptionInfo
must be decorated with cms.catch()
cms.catch()
-decorated modules into Path (either directly or via Sequence) is disallowed. If attempted the job terminates with an exception.cms.catch()
-decorated module is put in more than one Sequence/Task/EndPath, and not all of them have the cms.catch()
decoration, all such occurrences of the module are treated as if it were decorated with cms.catch()
everywhere.cms.catch()
-decorated module are ignoredRethrow
std::bad_alloc
) to be rethrown as wellExceptionFilter
) that would consume the ExceptionInfo
, and accept the Event only if any TryToContinue
exception was thrownBehavior of OutputModules follows from the description above; some examples below
A
, it will not run for the exception Event, except ifcms.catch()
, in which case it will run for all Events, except if
ExceptionFilter
, in which case it will run only for the exception EventA
, it will run for all Events (including exception Event), unlessExceptionFilter
, in which case it will run only for the exception EventEvery time an exception is ignored (with IgnoreCompletely
or with TryToContinue
), a log message is printed containing the exception message
We might want to log the ignored exceptions also in the framework job report
If any exception is thrown from Event processing whose category is not listed in TryToContinue
, the job is terminated
Exceptions from other parts than module Event functions (e.g. Runs, Lumis, EventSetup, Source, OutputModule) would continue to result job termination
All the new names are preliminary (suggestions for better names are welcome).
PR #42441 implements the TryToContinue
although changes were made to the behavior based on the initial attempt to write the code. The changes are
cms.catch
was not used, instead one marks a module by calling the python method shouldTryToContinue()
. This was done as we ultimately determined that the property we were trying to create is really a property of a module and NOT the property of how a module appears in a given Path. shouldTryToContinue()
can be placed on Paths without a problem.shouldTryToContinue()
throws an exception, the exception is only ignored if the exception is in the TryToContinue
list. This avoids accidentally hiding infrastructure failures such as a std::bad_alloc
exception.ExceptionInfo
was not created. Trying to figure out exactly how to schedule the creation of the ExceptionInfo
turned out to be extremely difficult (lots of potential circular dependencies could occur). It also turned out that the SelectEvents
of an OutputModule already supports filtering on Paths which have seen an exception, via exception@<pathname>
. When used in conjunction with TryToContinue
one can trigger an OutputModule to be run when an exception happens and still continue Event processing. Such abilities should provide a mechanism for the HLT to create an error stream and keep known problems from stopping a job.@Dr15Jones @makortel I found the initial description of the behaviour of the output modules a bit confusing.
In particular what does it mean that an output module depends on a module that throws an exception ?
Does it mean that it has a matching keep
statement ?
Does it mean that it has a SelectEvents
entry for a Path
that includes that module ?
Could you summarise how the output modules should behave, also in view of later changes?
In particular what does it mean that an output module depends on a module that throws an exception ? Does it mean that it has a matching keep statement ?
Yes. It also means any modules needed by those modules which created the data products to 'keep'.
Does it mean that it has a SelectEvents entry for a Path that includes that module ?
Not exactly. A Path which sees an exception (either because a module on the Path throws the exception OR an unscheduled module needed by the module on the Path throws an exception) will be marked as having an 'error' status. The OutputModule will only run if at least one of its Paths doesn't see the exception and the Path succeeds as normal.
To illustrate, I've put together a small program using some dummy test modules. (NOTE: I modified AddIntsProducer so that it would ignore any data products which are missing from the Event).
The following is a trivialized representation of the HLT.
shouldTryToContinue()
so that even if one of the filters dependent on a module that fails, the globalTrigger will still run.
import FWCore.ParameterSet.Config as cms
process = cms.Process("TEST")
process.source = cms.Source("EmptySource")
process.maxEvents.input = 3
process.options.TryToContinue = ["NotFound"] process.options.wantSummary = True
process.trackingHits = cms.EDProducer("FailingProducer") process.tracks = cms.EDProducer("AddIntsProducer", labels = cms.VInputTag("trackingHits")) process.trackFilter = cms.EDFilter("IntProductFilter", label = cms.InputTag("tracks"), threshold = cms.int32(0), shouldProduce = cms.bool(True) )
process.caloClusters = cms.EDProducer("IntProducer", ivalue = cms.int32(1)) process.caloFilter = cms.EDFilter("IntProductFilter", label = cms.InputTag("caloClusters"), threshold = cms.int32(0), shouldProduce = cms.bool(True) )
process.globalTrigger = cms.EDProducer("AddIntsProducer", labels = cms.VInputTag("trackFilter","caloFilter")) process.globalTrigger.shouldTryToContinue()
process.trackPath = cms.Path(process.trackingHits+process.tracks+process.trackFilter) process.caloPath = cms.Path(process.caloClusters+process.caloFilter) process.globalTriggerPath = cms.Path(process.globalTrigger)
outputTemplate = cms.OutputModule("AsciiOutputModule", outputCommands = cms.untracked.vstring("drop *", "keep edmTriggerResults__*"), SelectEvents = cms.untracked.PSet(SelectEvents = cms.vstring()))
process.trackOut = outputTemplate.clone(SelectEvents = dict(SelectEvents=["trackPath"])) process.caloOut = outputTemplate.clone(SelectEvents = dict(SelectEvents=["caloPath"])) process.trackAndCaloOut = outputTemplate.clone(SelectEvents=dict(SelectEvents=["trackPath","caloPath"])) process.globalTriggerOut = outputTemplate.clone(outputCommands = ["drop ", "keep edmTriggerResults_*","keep _globalTrigger_TEST"]) process.exceptionOut = outputTemplate.clone(SelectEvents=dict(SelectEvents=["exception@*"]))
process.out = cms.EndPath(process.trackOut+process.caloOut+process.trackAndCaloOut+process.exceptionOut+process.globalTriggerOut)
when run, all the OutputModules write the 3 events except for the 'trackOut' which writes no events as the only path it depends upon never succeeds (i.e. trackPath is set to the error state for each Event).
From the summary we see
TrigReport ---------- Path Summary ------------ TrigReport Trig Bit# Executed Passed Failed Error Name TrigReport 1 0 3 0 0 3 trackPath TrigReport 1 1 3 3 0 0 caloPath TrigReport 1 2 3 3 0 0 globalTriggerPath
TrigReport ---------- Module Summary ------------ TrigReport Visited Executed Passed Failed Error Name TrigReport 3 3 3 0 0 TriggerResults TrigReport 3 3 3 0 0 caloClusters TrigReport 3 3 3 0 0 caloFilter TrigReport 3 3 3 0 0 caloOut TrigReport 3 3 3 0 0 caloPath TrigReport 3 3 3 0 0 exceptionOut TrigReport 3 3 3 0 0 globalTrigger TrigReport 3 3 3 0 0 globalTriggerOut TrigReport 3 3 3 0 0 globalTriggerPath TrigReport 3 3 3 0 0 out TrigReport 3 3 3 0 0 trackAndCaloOut TrigReport 3 0 0 0 3 trackFilter TrigReport 3 3 3 0 0 trackOut TrigReport 3 3 3 0 0 trackPath TrigReport 3 3 0 0 3 trackingHits TrigReport 3 0 0 0 3 tracks
hi Chris, thanks, I think the example clarifies better the expected behaviour, at least for me.
@fwyzard @missirol Matti and I chatted today and we are not convinced that the new shouldTryToContinue()
python method makes for the best API. A couple concerns are making it work with dumpPython
as well as what would happen if the module is cloned?
Given the shouldTryToContinue is related to how a job responds to an exception (and not to how the algorithm for a module should be run) we thought that specifying the list of modules in the process.options
might be better. This matches how one specifies the delete early of data products. The addition to the options
block would be a untracked.vstring()
named something like modulesThatShouldTryToContinue
. Alternatively, we could change the options
block parameter TryToContinue
to be a untracked.PSet
with the parameters exceptions
and modulesToContinue
. The down side of the latter makes it very different from the other exception handling specifiers.
We were hoping to get your thoughts on either the shouldTryToContinue
method to the Modules or the extension of the options
block.
Let me add explicitly @Martin-Grunewald and @mmusich to the discussion.
ping @Martin-Grunewald @mmusich @missirol
Hi @Dr15Jones , thanks for these updates. I only have some naive feedback after reading the info above without testing #42441.
Fwiw, I agree with
specifying the list of modules in the
process.options
might be better.
One reason is that this makes it easy to control this in ConfDB
(where we can already define the options
PSet as needed, while I would not know how to handle .shouldTryToContinue()
).
The addition to the
options
block would be auntracked.vstring()
named something likemodulesThatShouldTryToContinue
. Alternatively, we could change theoptions
block parameterTryToContinue
to be auntracked.PSet
with the parametersexceptions
andmodulesToContinue
. The down side of the latter makes it very different from the other exception handling specifiers.
Regarding how to extend the options
PSet, I don't see particular constraints from the HLT side, so I would leave this to Core-Sw to decide.
Regarding how to extend the
options
PSet, I don't see particular constraints from the HLT side, so I would leave this to Core-Sw to decide.
I'd actually be interested to learn how the process.options
is configured for the HLT processes at online.
I'd actually be interested to learn how the
process.options
is configured for the HLT processes at online.
(Here's just what I think I know. Andrea or others can comment better.)
By default, the HLT menus in ConfDB
do not have an options
PSet. Online, a ConfDB
menu is converted into a python config (I assume with hltConfigFromDB
, or something very close to it), and DAQ appends to it a "patch" which originates from this template, where some of the parameters of process.options
are changed (other than this, the defaults of options
are used).
If necessary, it is possible to define a global PSet named options
in a ConfDB
configuration, where one would just specify the options
parameters with non-default values [*] (but again, this is currently not done for any menu used in production).
[*] One example is
hltConfigFromDB --configName --adg /cdaq/test/missirol/test/2023/week18/CMSLITOPS_411/Test02/HLT/V5 > foo.py
which was used to test the use of options.IgnoreCompletely
in the context of CMSHLT-2793.
Another example is
hltConfigFromDB --runNumber 355207 > bar.py # menu: /cdaq/physics/firstCollisions22/v4.0/HLT/V1
in which GPU offloading was disabled online by defining the following PSet in ConfDB
options = cms.untracked.PSet( accelerators = cms.untracked.vstring( 'cpu' ) )
Trying to give more concrete feedback on #42441.
I took that PR, specifically ab11897d..106c6518
(used on top of 13_3_0_pre1
), and tried to use it with a realistic HLT menu running on the problematic data that started this discussion (events where a cms::Exception
is thrown by the L1T unpacker). This test is in [1]: the input file contains 100 events, one of which throws an exception of type InvalidGlobalAlgBlkBxCollection
from the module hltGtStage2Digis
(plugin type: L1TRawToDigi
), and the script contains four cmsRun
jobs ("hlt1", "hlt2", "hlt3" and "hlt4"); all jobs use the same (100) input events. The cmsRun
logs of the four jobs are the following: hlt1.log, hlt2.log, hlt3.log and hlt4.log.
options.TryToContinue
in order to skip the event throwing the exception.
OutputModule
meant to collect the event that throw the exception. "hlt3" tries to do so respecting the rules used to build OutputModule
s in the current HLT menus, rules which are enforced by the ConfDB
GUI [2].
HLT_PathStatusError_v1
is meant to accept all events where at least one of the other triggers went into error state. HLT_PathStatusError_v1
is assigned to a Primary Dataset called HLTError
, which is in turn assigned to a "stream" called HLTError
. To implement HLT_PathStatusError_v1
, I had to introduce a new EDFilter
to check for the presence of Paths in error state (HLTPathStatusErrorFilter
in https://github.com/missirol/cmssw/commit/b67f107964f8bd7710914ea36c007bd03b65ccca, heavily based on TriggerResultsFilter
).hltGtStage2Digis
) inside the Paths HLT_PathStatusError_v1
and Dataset_HLTError
makes those Paths themselves go in error state for the problematic event. As a result, the problematic event is not written out by hltOutputHLTError
. This follows the description in [3], and it is a complication specific to the fact that the failing module is hltGtStage2Digis
: the latter module must be included in those Paths in order to know the index of the prescale column to be used by the instance of HLTPrescaler
in each of those Paths (in principle, hltGtStage2Digis
could be removed from HLT_PathStatusError_v1
, but the GUI would not allow to remove it from Dataset_HLTError
)."Modules in Path: HLT_PathStatusError_v1
(also, the lines around Modules in Path: Dataset_EphemeralZeroBias13
look incorrect). Not sure how to interpret this.modulesToCallForTryToContinue
is set to an empty list. In this case, the output of cmsRun
does include all the "TrigReport" messages (as far as I checked), e.g. lines like "Modules in Path: HLT_PathStatusError_v1"
.Tentative summary.
options.TryToContinue
works as expected in an example based on a realistic HLT menu (except maybe what happens with the "TrigReport" messages, see comments on "hlt3").hltGtStage2Digis
is particularly tricky to handle, given where this module needs to be used in the HLT menus. I did not find a way to send the problematic events to a dedicated dataset/stream using TryToContinue
while complying with the structure of the HLT menus. I should note that CMSHLT-2793 (comment) showed a way to handle this particular problem using IgnoreCompletely
(as opposed to TryToContinue
), so CMSSW already allows to solve this at configuration level, and maybe TryToContinue
(or the old SkipEvent
) is just not the best choice for this specific type of crashes.A few questions.
A
throwing exception B
, and (1) B
is in options.TryToContinue
and (2) all the modules in the Path that depend on A
are in options.modulesToCallForTryToContinue
, is it necessary that the state of the Path is marked as 'error' ?"TrigReport"
messages in "hlt3" and "hlt4" as expected ?TryToContinue
and/or modulesToCallForTryToContinue
accept wildcards ? (if not, would that be possible and useful ?)If you have suggestions/corrections, please let me know.
[1]
#!/bin/bash -ex
# cmsrel CMSSW_13_3_0_pre1
# cd CMSSW_13_3_0_pre1/src
# cmsenv
# git checkout -b test_cmssw42441
# git cms-merge-topic cms-sw:42441
# git cms-remote add missirol
# git cms-addpkg HLTrigger/HLTfilters
# git cp b67f107964f
# scram b -j 16
# run 366497, LS 196
INPUTFILE=root://eoscms.cern.ch//store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.root
RUNNUM=366497
run_test() {
rm -rf run"${RUNNUM}"*
convertToRaw -f 100 -l 100 -r "${RUNNUM}":196 -o . -- "${INPUTFILE}"
edmConfigDump "${tmpfile}" > "${1}".py
cmsRun "${1}".py &> "${1}".log || true
}
tmpfile=$(mktemp)
hltConfigFromDB --runNumber "${RUNNUM}" > "${tmpfile}"
cat <<@EOF >> "${tmpfile}"
process.load('run${RUNNUM}_cff')
# customisations to adapt 13_0_X HLT menus to CMSSW_13_3_0_pre1
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)
# remove prescales, and set GlobalTag
del process.PrescaleService
process.GlobalTag.globaltag = '130X_dataRun3_HLT_v2'
# show statistics on decisions of modules and Paths
process.options.wantSummary = True
# number of threads/streams used online by HLT
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF
# hlt1: reproduce online crash seen in run 366497 (LS 196)
jobLabel="hlt1"
run_test "${jobLabel}"
# hlt2: hlt1 + "try to continue" upon exceptions of type 'InvalidGlobalAlgBlkBxCollection'
jobLabel="hlt2"
cat <<@EOF >> "${tmpfile}"
process.options.TryToContinue = cms.untracked.vstring( 'InvalidGlobalAlgBlkBxCollection' )
@EOF
run_test "${jobLabel}"
# hlt3: hlt2 + send events to a separate PrimaryDataset and stream
jobLabel="hlt3"
cat <<@EOF >> "${tmpfile}"
# New Path to select events in which at least one other Path went in Error state
process.HLTBeginSequenceAny = cms.Sequence( process.hltGtStage2Digis )
process.hltPrePathStatusError = cms.EDFilter( "HLTPrescaler",
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( "hltGtStage2Digis" )
)
process.hltPathStatusErrorFilter = cms.EDFilter( "HLTPathStatusErrorFilter",
ignoreInvalidPathNames = cms.bool(False),
# consider all Paths in the configuration ..
pathNames = cms.vstring( '*' ),
# .. except for the Path holding this module, as well as DatasetPaths
pathNamesToSkip = cms.vstring( 'HLT_PathStatusError_v*', 'Dataset_*' ),
)
process.HLT_PathStatusError_v1 = cms.Path(
process.HLTBeginSequenceAny
+ process.hltPrePathStatusError
+ process.hltPathStatusErrorFilter
+ process.HLTEndSequence
)
# "DatasetPath": Path using TriggerResultsFilter to select on other Paths
process.hltPreDatasetHLTError = cms.EDFilter( "HLTPrescaler",
offset = cms.uint32( 0 ),
L1GtReadoutRecordTag = cms.InputTag( "hltGtStage2Digis" )
)
process.hltDatasetHLTError = cms.EDFilter( "TriggerResultsFilter",
usePathStatus = cms.bool( True ),
hltResults = cms.InputTag( "" ),
l1tResults = cms.InputTag( "" ),
l1tIgnoreMaskAndPrescale = cms.bool( False ),
throw = cms.bool( True ),
triggerConditions = cms.vstring( 'HLT_PathStatusError_v1' )
)
process.Dataset_HLTError = cms.Path(
process.HLTDatasetPathBeginSequence
+ process.hltDatasetHLTError
+ process.hltPreDatasetHLTError
)
# "StreamPath": FinalPath with OutputModule selecting on DatasetPath
process.hltOutputHLTError = cms.OutputModule("GlobalEvFOutputModule",
SelectEvents = cms.untracked.PSet(
SelectEvents = cms.vstring(
'Dataset_HLTError'
)
),
compression_algorithm = cms.untracked.string('ZSTD'),
compression_level = cms.untracked.int32(3),
lumiSection_interval = cms.untracked.int32(0),
outputCommands = cms.untracked.vstring(
'drop *',
'keep FEDRawDataCollection_rawDataCollector_*_*',
'keep FEDRawDataCollection_source_*_*',
'keep GlobalObjectMapRecord_hltGtStage2ObjectMap_*_*',
'keep edmTriggerResults_*_*_*',
'keep triggerTriggerEvent_*_*_*'
),
psetMap = cms.untracked.InputTag("hltPSetMap"),
use_compression = cms.untracked.bool(True)
)
process.HLTErrorOutput = cms.FinalPath( process.hltOutputHLTError )
# update cms.Schedule adding new Path, DatasetPath, and StreamPath
process.schedule.extend([
process.HLT_PathStatusError_v1,
process.Dataset_HLTError,
process.HLTErrorOutput
])
# update the global PSets "datasets" and "streams"
# (just to mimic the db->python converter of ConfDB)
process.datasets.HLTError = cms.vstring( 'HLT_PathStatusError_v1' )
process.streams.HLTError = cms.vstring( 'HLTError' )
# prevent the Paths HLT_PathStatusError_v1 and Dataset_HLTError
# from going into Error state themselves
process.options.modulesToCallForTryToContinue = cms.untracked.vstring(
'hltPrePathStatusError',
'hltPreDatasetHLTError'
)
@EOF
run_test "${jobLabel}"
# hlt4: hlt3 but with an empty modulesToCallForTryToContinue
jobLabel="hlt4"
cat <<@EOF >> "${tmpfile}"
process.options.modulesToCallForTryToContinue = cms.untracked.vstring()
@EOF
run_test "${jobLabel}"
[2] In the current HLT menus, every OutputModule
is associated to one (and only one) so-called "stream" (a "stream" is a cms.FinalPath
). Every "stream" (or rather, its OutputModule
) selects the events accepted by the logical OR of a list of so-called "DatasetPaths" (a "DatasetPath" is a cms.Path
; "DatasetPaths" represent the Primary Datasets used offline; every "DatasetPath" must be assigned to one, and only one, "stream"). Every "DatasetPath" uses the plugin TriggerResultsFilter
to select events passing the logical OR of a set of triggers (a trigger is a cms.Path
). This structure is apparent looking, for example, at this HLT config. This means that, in order to have an output file with a certain set of events, one must define (as a minimum) one "stream", a "DatasetPath" assigned to said "stream", and a trigger assigned to that "DatasetPath" and accepting the events of interest. Conversely, configuring an OutputModule
to have SelectEvents = ["exception@*"]
like in https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1664127254 is not currently possible in HLT menus built with the ConfDB
GUI.
[3] (from https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1664127254)
A Path which sees an exception (either because a module on the Path throws the exception OR an unscheduled module needed by the module on the Path throws an exception) will be marked as having an 'error' status.
@missirol thanks very much for your very thorough report and attempting to use the TryToContinue mechanism.
Evidently we haven't made well known that the SelectEvents specification of OutputModules has supported from the very beginning the ability to trigger on paths that have an error. This is done by adding exception@
followed by the path name or *. Then if the path is in the error state, the OutputModule will be run.
Could you try an additional case where you get rid of your error path and try using @exception
instead?
Could you try an additional case where you get rid of your error path and try using
@exception
instead?
Sure, I tried this in [1] ("hlt5"). The job completes, and it produces (among other things) a streamer file named
run366497/run366497_ls0196_streamHLTError_pid3075644.dat
which contains the one event where the exception is thrown (checked after repacking it). So, it looks to me like it's working. The corresponding cmsRun
log is attached, hlt5.log.
[1]
#!/bin/bash -ex
# cmsrel CMSSW_13_3_0_pre1
# cd CMSSW_13_3_0_pre1/src
# cmsenv
# git checkout -b test_cmssw42441
# git cms-merge-topic cms-sw:42441
# git cms-remote add missirol
# git cms-addpkg HLTrigger/HLTfilters
# git cp b67f107964f
# scram b -j 16
# run 366497, LS 196
INPUTFILE=root://eoscms.cern.ch//store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/error_stream/run366497/run366497_ls0196_index000095_fu-c2b01-26-01_pid1955211.root
RUNNUM=366497
run_test() {
rm -rf run"${RUNNUM}"*
convertToRaw -f 100 -l 100 -r "${RUNNUM}":196 -o . -- "${INPUTFILE}"
edmConfigDump "${tmpfile}" > "${1}".py
cmsRun "${1}".py &> "${1}".log || true
}
tmpfile=$(mktemp)
hltConfigFromDB --runNumber "${RUNNUM}" > "${tmpfile}"
cat <<@EOF >> "${tmpfile}"
process.load('run${RUNNUM}_cff')
# customisations to adapt 13_0_X HLT menus to CMSSW_13_3_0_pre1
from HLTrigger.Configuration.customizeHLTforCMSSW import customizeHLTforCMSSW
process = customizeHLTforCMSSW(process)
# remove prescales, and set GlobalTag
del process.PrescaleService
process.GlobalTag.globaltag = '130X_dataRun3_HLT_v2'
# show statistics on decisions of modules and Paths
process.options.wantSummary = True
# number of threads/streams used online by HLT
process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF
# hlt5: send events with at least one Path in error state to a separate output file, using the syntax "exception@*"
jobLabel="hlt5"
cat <<@EOF >> "${tmpfile}"
process.options.TryToContinue = cms.untracked.vstring( 'InvalidGlobalAlgBlkBxCollection' )
# "StreamPath": FinalPath with OutputModule selecting on DatasetPath
process.hltOutputHLTError = cms.OutputModule("GlobalEvFOutputModule",
SelectEvents = cms.untracked.PSet(
SelectEvents = cms.vstring(
'exception@*'
)
),
compression_algorithm = cms.untracked.string('ZSTD'),
compression_level = cms.untracked.int32(3),
lumiSection_interval = cms.untracked.int32(0),
outputCommands = cms.untracked.vstring(
'drop *',
'keep FEDRawDataCollection_rawDataCollector_*_*',
'keep FEDRawDataCollection_source_*_*',
'keep GlobalObjectMapRecord_hltGtStage2ObjectMap_*_*',
'keep edmTriggerResults_*_*_*',
'keep triggerTriggerEvent_*_*_*'
),
psetMap = cms.untracked.InputTag("hltPSetMap"),
use_compression = cms.untracked.bool(True)
)
process.HLTErrorOutput = cms.FinalPath( process.hltOutputHLTError )
# update cms.Schedule adding (Final)Path with new OutputModule
process.schedule.append( process.HLTErrorOutput )
@EOF
run_test "${jobLabel}"
Sure, I tried this in [1] ("hlt5"). The job completes, and it produces (among other things) a streamer file named
@missirol Thanks! [I am trying to build an area based on your recipe from earlier but compilation will take a long time :) ]
Is the TrigReport
for this case also not what you expect? If so, could you explain what you did expect?
@missirol to better understand the TrigReport
, does hltDatasetEphemeralZeroBias
depend on the data product from hltGtStage2Digis
?
to better understand the
TrigReport
, doeshltDatasetEphemeralZeroBias
depend on the data product fromhltGtStage2Digis
?
No, the module hltDatasetEphemeralZeroBias
does not depend directly on hltGtStage2Digis
. (here one can get an idea of the dependencies, those modules did not change significantly in the recent past).
Is the
TrigReport
for this case also not what you expect? ["hlt5", editor's note]
Yes, with the caveat that I'm not doing any detailed checks, but rather just quickly scanning thousands of lines by eye looking for anything odd.
If so, could you explain what you did expect?
I'm mostly looking at lines that contain TrigReport ---------- Modules in Path
. I would expect to have one such line for every Path in the configuration. It looks like this is the case for all tests except for "hlt3.log" in https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1676475766. The only difference between "hlt3" and "hlt4" is that in the former the list modulesToCallForTryToContinue
is not empty. "hlt3.log" contains lines like [1], which just seem messed up. I know I have seen this before, but unfortunately I do not remember well when. It might have been when I was playing around with SkipEvent
, but I'm not sure at all.
[1]
TrigReport 1 27 0 0 0 0 hltSelectorCentralJets20L1FastJeta
TrigReport 1 43 0 0 0 0 hltFastPVJetVertexChecker
TrigReport 1 58 0 0 0 0 hltSelector8CentralJetsL1FastJet
TrigReport 1 74 0 0 0 0 hltIter1PFlowPixelHitDoubletsForDisplaced
TrigReport 1 89 0 0 0 0 hltIter2PFlowPixelClusterCheckForDisplaced
TrigReport ---------- Modules in Path: HLT_L1Tau_DelayedJet40_SingleDelay3p5nsInclusive_v2 ------------
TrigReport 1 38 2 2 0 0 hltHbhereco
TrigReport 1 11 2 2 0 0 hltAK4CaloJets
TrigReport 1 25 2 0 2 0 hltCaloJetTimingFilterSingleTau3p5ns
TrigReport Trig Bit# Visited Passed Failed Error Name
TrigReport 1 132 0 0 0 0 hltHbhereco
TrigReport 1 12 0 0 0 0 hltAK4CaloJetsIDPassed
TrigReport 1 27 0 0 0 0 hltSelectorCentralJets20L1FastJeta
TrigReport 1 40 0 0 0 0 hltFastPVPixelTracksHitQuadruplets
TrigReport 1 56 0 0 0 0 hltSelectorJets30L1FastJet
TrigReport 1 70 0 0 0 0 hltIter1MaskedMeasurementTrackerEventForDisplaced
TrigReport 1 85 0 0 0 0 hltIter2ClustersRefRemovalForDisplaced
TrigReport 1 100 0 0 0 0 hltL4PromptDisplacedDijetFullTracksJetTagProducerFromIPLow30Pt1PtrkShortSig5
TrigReport 1 115 0 0 0 0 hltIter4MergedWithIter012DisplacedJets
If so, could you explain what you did expect?
Another tiny example below. I think all are correct except for "hlt3", which is clearly off.
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt1.log | wc -l
89
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt2.log | wc -l
89
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt3.log | wc -l
2
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt4.log | wc -l
90
> grep 'TrigReport ---------- Modules in Path: Dataset_' hlt5.log | wc -l
89
So for the weirdness in hlt3.log, the only way I can even beginning to think how that printout could happen is if multiple threads were simultaneously running endJob
. If that is true, it would be very BAD.
I'll definitely look into.
Be that as it may, does the configuration in hlt5 suit your needs?
On a quick look, another thing different between hlt3.log and hlt5.log is the number of message logger messages that were skipped. For hlt3.log it was around 57,000 while hlt5.log only 150. It is possible that endJob
is NOT running concurrently, but instead another task is running concurrently with endJob and that task is output a lot of stuff to the message logger. If that were the case, then the hlt3.log could have some of its messages dropped in the TrigReport and that is what causes the weird looking output.
Maybe adding the Tracer service to the job might uncover the behavior.
Be that as it may, does the configuration in hlt5 suit your needs?
Not fully, in the sense that "hlt5" uses exception@*
in SelectEvents
and that syntax cannot be used (as of now) in the HLT menus we use in production, as those follow a rather stiff structure when it comes to how triggers, datasets and output modules are defined (this is what I briefly described in note [2] of https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1676475766).
As I was trying to explain in https://github.com/cms-sw/cmssw/issues/41512#issuecomment-1676475766, "hlt3" follows the HLT "rules", it uses TryToContinue
, and it is likely a good solution for skipping "excepted" events and sending them to a dedicated stream for most types of exceptions , just not those coming from hltGtStage2Digis
. The latter case is more complicated, because hltGtStage2Digis
must be inside every so-called "DatasetPath", so an exception in hltGtStage2Digis
gets every "DatasetPath" in error state.
@missirol what are the odds that the ConfDB limitations could be lifted? Seems like those might be more of a hinderance than a help for dealing with special cases.
In general, bad odds (as usual with ConfDB
), but I'm not sure this particular case comes down to just a limitation of ConfDB
.
The GUI and the db can often be limiting factors when it comes to supporting new features of CMSSW inside HLT menus, but here ConfDB
is simply enforcing the long-standing paradigm of having triggers assigned to datasets, and datasets assigned to streams. Having an OutputModule with SelectEvents = ["exception@*"]
(a stream of events without specific datasets assigned to it) would be an.. exception to this paradigm. Maybe it's not impossible, but it's something that requires careful consideration (the actual HLT experts are free to comment, of course).
My current take is that what the framework will provide after #42441 is sufficient for HLT.
#42441 basically fixes the old SkipEvent
, and TryToContinue
is what HLT could use to skip the vast majority of cms::Exception
s (pretty much any exception that does not come from L1TRawToDigi
, TriggerResultsFilter
or HLTPrescaler
; I think exceptions from the last two are truly unlikely, if even possible).
For the case of crashes from L1TRawToDigi
, I think they could be circumvented in a different way (e.g. IgnoreCompletely
).
Let me add that, based on discussions had in TSG, the likelihood of HLT deciding to bypass exceptions in production is very low. Skipping events at HLT is seen as an extreme measure, because doing so removes pressure from finding the actual solution to the problem (where the problem is usually outside of HLT, like this year in the case of corrupted data from L1T causing crashes when unpacked at HLT).
Let me add that, based on discussions had in TSG, the likelihood of HLT deciding to bypass exceptions in production is very low. Skipping events at HLT is seen as an extreme measure, because doing so removes pressure from finding the actual solution to the problem (where the problem is usually outside of HLT, like this year in the case of corrupted data from L1T causing crashes when unpacked at HLT).
From internal discussions with the framework team, we also see any use of TryToContinue
or IgnoreCompletely
as a short-time use until the actually underly problem can be fixed item and not something that should be used for standard workflows long term.
I think I probably determined why the TrigReport had a problem. It looks like messages are being dropped while the summary is being printed. I found that if I turned on INFO reporting, then FastMonitoringService
is periodically printing INFO messages during endJob.
It looks like some times the FastMonitoringService
printing might be slowed down (e.g. operating system temporarily halting that thread) and then since that thread becomes temporarily in-charge of printing all MessageLogger messages, the huge flood of messages from the summary overwhelms the queue and it starts to drop messages.
The message appears to be coming from here
So after modifying FastMonitoringService
to have an option to not do LogInfo on its own thread and using that option, I did not see any loss of TrigReport messages for 10 runs of the job. Previously I'd see some losses after about 3 runs of the job.
+core
I think this issue can be closed by now
@cmsbuild, please close
cms-bot internal usage
This issue is fully signed and ready to be closed.
The reproducer in [1] (
CMSSW_13_0_5_patch1
, input file on lxplus) tries to useoptions.skipEvent
(documented, for example, in SWGuideEdmExceptionUse#Framework_Exception_Handling) in order skip an event which is known to throw an exception of type"InvalidGlobalAlgBlkBxCollection"
from the modulehltStage2GtDigis
.Naively, I was expecting the job to skip the event and succeed. Instead, I see that the job fails because a different module on one EndPath throws a different exception ("ProductNotFound") while attempting to access the products of
hltStage2GtDigis
(which are likely not produced becausehltStage2GtDigis
fails due to"InvalidGlobalAlgBlkBxCollection"
). The error message of the reproducer is in [2]. @fwyzard spotted that the message quotes"Begin IgnoreCompletely"
, and does not quote"Begin SkipEvent"
. A simple search leads me to this:One workaround is to include
ProductNotFound
inoptions.skipEvent
.Question: are there "better" ways ?
Context : this issue is related to https://github.com/cms-sw/cmssw/issues/41489#issuecomment-1532696126, as we look into the feasibility of using
options.skipEvent
to avoid the frequent HLT crashes seen online these days due to the L1T unpacker (CMSLITOPS-411).FYI: @silviodonato @cms-sw/hlt-l2
[1]
[2]