cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

`Fatal Exception` in Prompt Reco of Run 367232, datatset `JetMET0` #41645

Open mmusich opened 1 year ago

mmusich commented 1 year ago

Dear all, there is one job failing Prompt Reco for run 367232, datatset JetMET0 with a Fatal Exceptionas described in https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996

The crash seems to originate from the module L1TObjectsTiming:

----- Begin Fatal Exception 12-May-2023 03:17:41 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 367232 lumi: 190 event: 378449946 stream: 6
   [1] Running path 'dqmoffline_1_step'
   [2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 9) >= this->size() (which is 5)
----- End Fatal Exception -------------------------------------------------

The exception is reproducible on a lxplus8 node under CMSSW_13_0_5_patch2 (el8_amd64_gcc11). Full logs and PSet.py can be found at https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run367232_JetMET0/Reco/vocms014.cern.ch-415905-3-log.tar.gz With this modified PSet.py file the crash occurs immediately:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads = 1
    process.source.skipEvents=cms.untracked.uint32(2683)

It should be noted that the crash is preceded by these warning (perhaps related):

%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 12-May-2023 08:33:34 CEST  Run: 367232 Stream: 0
Algo "L1_SingleJet60er2p5" not found in the trigger menu L1Menu_Collisions2023_v1_0_0. Could not retrieve algo bit number.
%MSG
%MSG-w L1TStage2uGTTiming:   L1TStage2uGTTiming:l1tStage2uGTTiming@streamBeginRun 12-May-2023 08:33:34 CEST  Run: 367232 Stream: 0
Algo "L1_SingleJet60_FWD3p0" not found in the trigger menu L1Menu_Collisions2023_v1_0_0. Could not retrieve algo bit number.
%MSG
mmusich commented 1 year ago

assign dqm, l1

cmsbuild commented 1 year ago

New categories assigned: dqm,l1

@tjavaid,@epalencia,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@aloeliger,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @mmusich Marco Musich.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 1 year ago

with a bit of offline diagnosis, the crash occurs here:

https://github.com/cms-sw/cmssw/blob/9556ab969f11914c1583f8f720bf7f36b4269479/DQM/L1TMonitor/src/L1TObjectsTiming.cc#L819

index equals to 9, but the muons_eta_phi.size() is 5.

mmusich commented 1 year ago

This apparently happens because in that event (Run 367232, Event 378449946, LumiSection 190) MuonBxCollection->getFirstBX() is -10 and MuonBxCollection->getLastBX() is 11, while in other events it seems to be constrained between -2 and 2.

mmusich commented 1 year ago

There is now a second Prompt Reco paused job with same symptoms

mmusich commented 1 year ago

urgent

mmusich commented 1 year ago

There is now a second Prompt Reco paused job with same symptoms

Also the second crash is reproducible under CMSSW_13_0_5_patch1 on lxplus8 with the following PSet.py

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads = 1
    process.source.skipEvents=cms.untracked.uint32(680)

(taking the rest of the configuration here)

The exception message is:

----- Begin Fatal Exception 13-May-2023 18:36:50 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 367315 lumi: 40 event: 94344670 stream: 0
   [1] Running path 'dqmoffline_1_step'
   [2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 18446744073709551605) >= this->size() (which is 5)
----- End Fatal Exception -------------------------------------------------
mmusich commented 1 year ago

Technically this change circumvents the crashes, though it would be good if the L1T / DQM experts find the root cause.

diff --git a/DQM/L1TMonitor/interface/L1TObjectsTiming.h b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
index d39f6a4fe96..42b3ff4616e 100644
--- a/DQM/L1TMonitor/interface/L1TObjectsTiming.h
+++ b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
@@ -38,6 +38,8 @@ protected:
   void dqmBeginRun(const edm::Run&, const edm::EventSetup&) override;
   void bookHistograms(DQMStore::IBooker&, const edm::Run&, const edm::EventSetup&) override;
   void analyze(const edm::Event&, const edm::EventSetup&) override;
+  template <typename T>
+  bool checkBXCollection(const T*&);

 private:
   edm::EDGetTokenT<l1t::MuonBxCollection> ugmtMuonToken_;
diff --git a/DQM/L1TMonitor/src/L1TObjectsTiming.cc b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
index c8edfe65685..bb6cfefa67f 100644
--- a/DQM/L1TMonitor/src/L1TObjectsTiming.cc
+++ b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
@@ -1,4 +1,5 @@
 #include "DQM/L1TMonitor/interface/L1TObjectsTiming.h"
+#include "FWCore/Utilities/interface/TypeDemangler.h"

 L1TObjectsTiming::L1TObjectsTiming(const edm::ParameterSet& ps)
     : ugmtMuonToken_(consumes<l1t::MuonBxCollection>(ps.getParameter<edm::InputTag>("muonProducer"))),
@@ -783,25 +784,44 @@ void L1TObjectsTiming::bookHistograms(DQMStore::IBooker& ibooker, const edm::Run
   }
 }

+template <typename T>
+bool L1TObjectsTiming::checkBXCollection(const T*& BXCollection) {
+  // check that all the BX collections do not exceed the expected range
+  if (BXCollection->getLastBX() - BXCollection->getFirstBX() > int(bxrange_ - 1)) {
+    edm::LogError("L1TObjectsTiming") << " Unexpected bunch crossing range in "
+                                      << edm::typeDemangle(typeid(BXCollection).name())
+                                      << " lastBX() = " << BXCollection->getLastBX()
+                                      << " - firstBX() = " << BXCollection->getFirstBX();
+    return false;
+  } else {
+    return true;
+  }
+}
+
 void L1TObjectsTiming::analyze(const edm::Event& e, const edm::EventSetup& c) {
   if (verbose_)
     edm::LogInfo("L1TObjectsTiming") << "L1TObjectsTiming: analyze..." << std::endl;

   // Muon Collection
-  edm::Handle<l1t::MuonBxCollection> MuonBxCollection;
-  e.getByToken(ugmtMuonToken_, MuonBxCollection);
+  const l1t::MuonBxCollection* MuonBxCollection = &e.get(ugmtMuonToken_);
+  if (!checkBXCollection(MuonBxCollection))
+    return;
   // Jet Collection
-  edm::Handle<l1t::JetBxCollection> JetBxCollection;
-  e.getByToken(stage2CaloLayer2JetToken_, JetBxCollection);
+  const l1t::JetBxCollection* JetBxCollection = &e.get(stage2CaloLayer2JetToken_);
+  if (!checkBXCollection(JetBxCollection))
+    return;
   // EGamma Collection
-  edm::Handle<l1t::EGammaBxCollection> EGammaBxCollection;
-  e.getByToken(stage2CaloLayer2EGammaToken_, EGammaBxCollection);
+  const l1t::EGammaBxCollection* EGammaBxCollection = &e.get(stage2CaloLayer2EGammaToken_);
+  if (!checkBXCollection(EGammaBxCollection))
+    return;
   // Tau Collection
-  edm::Handle<l1t::TauBxCollection> TauBxCollection;
-  e.getByToken(stage2CaloLayer2TauToken_, TauBxCollection);
+  const l1t::TauBxCollection* TauBxCollection = &e.get(stage2CaloLayer2TauToken_);
+  if (!checkBXCollection(TauBxCollection))
+    return;
   // EtSum Collection
-  edm::Handle<l1t::EtSumBxCollection> EtSumBxCollection;
-  e.getByToken(stage2CaloLayer2EtSumToken_, EtSumBxCollection);
+  const l1t::EtSumBxCollection* EtSumBxCollection = &e.get(stage2CaloLayer2EtSumToken_);
+  if (!checkBXCollection(EtSumBxCollection))
+    return;

   // Open uGT readout record
   edm::Handle<GlobalAlgBlkBxCollection> uGtAlgs;
aloeliger commented 1 year ago

@mmusich I'll treat this as a priority tomorrow.

aloeliger commented 1 year ago

Okay, a quick 15 minute dissection shows that the crashing bxvector here is (at least initially) the muon bxvector. It has 22 BX's in it (-10 to 11) (instead of the expected 5), and is quite (suspiciously?) full in general:

Size per BX: -10: 1 -9: 2 -8: 2 -7: 2 -6: 2 -5: 2 -4: 2 -3: 2 -2: 2 -1: 2 0: 4 1: 2 2: 2 3: 2 4: 2 5: 2 6: 1 7: 2 8: 2 9: 2 10: 1 11: 1

It is also worth noting under this set-up, we do get a previous event, which shows the proper 5 BX's:

Size per BX: -2: 0 -1: 2 0: 0 1: 0 2: 0

My knee-jerk reaction seeing this is that it is data corruption, perhaps related to the ongoing uGT packer/unpacker problem, but I have no definitive proof of that.

The exact crash of course occurs here,

https://github.com/cms-sw/cmssw/blob/e2d38115a927b9dcaceeeacbcb209605eb0d60e7/DQM/L1TMonitor/src/L1TObjectsTiming.cc#L818-L819

It finds some theoretically valid muon in BX 7:

Muon in collection that passes qual cuts Muon->pt(): 61 Muon->hwQual(): 12

and then comes up with an "index" of 9 for it in the vector of histograms in that arcane equation there. It then tries to access muons_eta_phi at this index, which was previously created in this loop here:

https://github.com/cms-sw/cmssw/blob/e2d38115a927b9dcaceeeacbcb209605eb0d60e7/DQM/L1TMonitor/src/L1TObjectsTiming.cc#L122-L131

coincidentally, bxrange_ is a const hard coded to 5 in the member initializers here:

https://github.com/cms-sw/cmssw/blob/e2d38115a927b9dcaceeeacbcb209605eb0d60e7/DQM/L1TMonitor/src/L1TObjectsTiming.cc#L3-L34

instead of being some constexpr or something.

And when it tries to access the location defined by a bxrange of 5, at index 9, you get the usual c++ vector issue.

This is going to crash (and crash un-informatively) therefore on anything with BXVectors with BX's outside of the expected 5. As an intermediate step, I could test each of the input BXVectors to make sure they are in the expected range, and throw a better cms::exception making the assumptions of this clear, but I can't explain where the messed up BXVector came from yet. @mmusich would it be a good idea to insert such a test/exception into this?

the collection these muons are loaded out of is defined by muonProducer here:

https://github.com/cms-sw/cmssw/blob/ecf1aaec06678b04066926b8521879a29da4bdc2/DQM/L1TMonitor/python/L1TObjectsTiming_cfi.py#L4-L7

which is pretty much directly the gtStage2 digis. I could try to trace it back further, but that is pretty much just the raw to digi step at that point:

https://github.com/cms-sw/cmssw/blob/ecf1aaec06678b04066926b8521879a29da4bdc2/EventFilter/L1TRawToDigi/python/gtStage2Digis_cfi.py#L3-L8

(which is some evidence for corrupted data)

mmusich commented 1 year ago

@aloeliger

would it be a good idea to insert such a test/exception into this?

I think so. Incidentally that's (almost) what I was proposing at https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1546902922. Though I (personally) wouldn't crash the entire Prompt Reco because of a mismatch with the expected data format in a DQM module. Emitting a LogError might be better suited for this case.

aloeliger commented 1 year ago

@mmusich forgive my inexperience on this but are cms::exceptions catchable/ignorable, or can you force the event to skip at the configuration level?

I'm not sure we want to continue computation on some event we know has corrupted uGT data. We're just asking to crash somewhere else at that point, or continuing processing on junk.

mmusich commented 1 year ago

or can you force the event to skip at the configuration level?

this is not what what happens at Tier-0 and in general it's not a good policy because of the reasons explained in the thread at https://github.com/cms-sw/cmssw/issues/41512. Either the data is so corrupted that an exception is thrown (with consequent complete disruption of the job) or you keep processing the junk (as in your words). The question is if the data corruption discussed here is severe enough to declare the event completely unusable for physics. If not (from the tests I did circumventing the crash, it seems the processing just continues without consequences) it would be better to perhaps handle the corrupt data at the unpacker level and create empty collections and let the clients deal with it ~ naturally.

fwyzard commented 1 year ago

FWIW I agree that it would be better to consider the event as bad and skip it.

It is possible to throw a cms exception and configure the job to skip the event. However it's not obvious how to avoid writing a partially reconstructed event to the output files.

Maybe the Core Software group has some suggestions.

.A

aloeliger commented 1 year ago

Well, the L1 packer/unpacker pretty clearly needs some robustness upgrades in any case, but that is a longer term proposition.

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

mmusich commented 1 year ago

@cms-sw/core-l2

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

please advise, also in view of https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1547782124

fwyzard commented 1 year ago

Either the data is so corrupted that an exception is thrown (with consequent complete disruption of the job) or you keep processing the junk (as in your words). The question is if the data corruption discussed here is severe enough to declare the event completely unusable for physics. If not (from the tests I did circumventing the crash, it seems the processing just continues without consequences) it would be better to perhaps handle the corrupt data at the unpacker level and create empty collections and let the clients deal with it ~ naturally.

The problem with this approach is that the main consumer of the L1T information is the HLT, and the HLT will not be re-run during or after offline processing.

So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements.

mmusich commented 1 year ago

So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements.

We agree, but then planning an exception in a DQM module that runs in Prompt (2 days after data is taken) is less then useless.

makortel commented 1 year ago

@cms-sw/core-l2

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

please advise, also in view of #41645 (comment)

Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.

While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in https://github.com/cms-sw/cmssw/issues/41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle

process.options.SkipEvent = cms.untracked.vstring("<exception category>")

should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be OutputModule.SelectEvents.SelectEvents or inserting TriggerResultsFilter in front of the OutputModule in the EndPath.

aloeliger commented 1 year ago

Well, It isn't a fast solution, but I can take a look at the unpacker and see if there's a way to provide proper null sets out of the unpacker if it doesn't pass a few basic quality tests.

gpetruc commented 1 year ago

Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)

aloeliger commented 1 year ago

Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)

It may be just the muons. The muons were simply the first thing that crashed. I can check other collections for corruption too.

aloeliger commented 1 year ago

@cms-sw/core-l2

I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.

please advise, also in view of #41645 (comment)

Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.

While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in #41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle

process.options.SkipEvent = cms.untracked.vstring("<exception category>")

should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be OutputModule.SelectEvents.SelectEvents or inserting TriggerResultsFilter in front of the OutputModule in the EndPath.

@makortel I didn't have a lot of time to think about this yesterday, but could I borrow your judgement on how to go about unpacker modifications for this?

If I go through the various GT unpackers and add a step into the process where it checks it's own output for say, having 5 BX's as a check on whether the data is corrupted, I could conceivably log an error and force it to instead output some empty BXVector that is unlikely to disturb anything downstream of it, but I'm not sure, without looking at the log errors by eye, how one catches this and differentiates this event as a problem, instead of an event with some genuinely empty vector. I agree with @gpetruc that someone really does need to catch this but without disturbing other elements of the reconstruction, but I am concerned that simply outputting empty collections may help to bury the problem rather than catching it, alerting, and gracefully continuing.

I entertained the idea of some sort of module-in-the-middle style solution that would run after the unpackers in the RawToDigi sequence to inspect and catch any potential data corruption and insert a bit or something into the output that could be used to flag corruption in the unpacker so that other downstream clients could catch this, or HLT could filter on it, but I would imagine this is more trouble than it is worth, involves sensitive modifications to everything reliant on the RawToDigi step, and is also not truly a solution to the issue of some downstream something attempting to process on bad data. I guess I would also be terrified of potentially introducing a mistaken 100% removal filter into the process because of bad configuration of this.

The existing muon unpacker responsible for GT muons, currently shows some corruption checks that are designed to throw exceptions:

https://github.com/cms-sw/cmssw/blob/96423893105e2482c0882f8f0e5fbc61d745605e/EventFilter/L1TRawToDigi/plugins/implementations_stage2/MuonUnpacker.cc#L13-L58

... but from the sounds of it, this is pretty off the table at this point, having made it to this stage?

makortel commented 1 year ago

How is data corruption handled in unpackers of other sub-detectors? (bringing in @cms-sw/reconstruction-l2)

makortel commented 1 year ago

After discussing with @Dr15Jones we came to the conclusion that from the framework point of view using "empty containers" is not a good way to communicate processing failures, because there is no way to distinguish that from the container being empty for "legitimate reasons". It would be better if the consuming modules would be able to ask questions like "is the input data product valid" and "if the input data is not valid, why is it not". In principle the edm::Handle already provides an interface for a consuming module to ask such questions (isValid(), whyFailed()), but we currently use those to communicate only that the data product is missing.

If there is interest in the following kind of behavior, we could look into evolving the SkipEvent exception handling behavior in the framework accordingly:

In the mean time, similar behavior can to large degree accomplished by

aloeliger commented 1 year ago

Okay. Then, if I am understanding correctly, L1 needs to:

  1. Modify the GT unpacker to check for corruption, if it finds it, do not put or emplace anything into the event
  2. Provide a filter that can be run with the unpacker, that checks for the data product being there. If it is not found, the filter fails, filtering the rest of the path
  3. Provide a producer/filter designed to copy the products of the GT unpacker, if they exist, otherwise create default ones.
  4. Insert these into GT unpacker dependent sequences

In the long run the first set of things does seem ideal to me for this sort of situation, but I would like to help with the immediate solution first.

aloeliger commented 1 year ago

@mmusich @makortel Following what I think was the desired short term solution, I have created a filter for the GT digis that will check for corruption (in this case, right now this is only defined as having output BX vectors with size different than a configured size), and will also attempt to produce either an empty BXvector if corruption is detected, or an identical BX vector in the case that it is not. The EDFilter will return false when any corruption of this kind is detected, and true whenever there is no corruption.

@makortel my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?

I will attempt to test this on the current problem recipe in any case.

fwyzard commented 1 year ago

Hi @aloeliger, my impression is that, if you wrote an EDFilter, making a copy of the collections is redundant:

So, wouldn't it be enough to have an EDFilter that returns true or false and does not produce anything ?

makortel commented 1 year ago

So, wouldn't it be enough to have an EDFilter that returns true or false and does not produce anything ?

The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1554925070).

makortel commented 1 year ago

my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?

Correct. Although from the framework perspective it would be better if the actual producer of GT digis would not produce the data product in case of the error, because then the error condition is more clear for all consumers of the data product.

Do I assume correctly that the GT digis are not stored as such on any data tier?

missirol commented 1 year ago

Do I assume correctly that the GT digis are not stored as such on any data tier?

Fwiw, HLT currently stores them in the output of certain data streams (looking for keep *_hltGtStage2Digis_*_* in OnLine_HLT_GRun.py).

makortel commented 1 year ago

Do I assume correctly that the GT digis are not stored as such on any data tier?

Fwiw, HLT currently stores them in the output of certain data streams (looking for keep *_hltGtStage2Digis_*_* in OnLine_HLT_GRun.py).

Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing hltGtStage2Digis, the OutputModules will not store the events that the EDFilter rejects. Is this assumption correct?

I also see them being stored as part of HLTDebugRAW https://github.com/cms-sw/cmssw/blob/a3479e2edf6588f3227d707f8db676aa2ab59cbe/HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py#L167 and HLTDebugFEVT https://github.com/cms-sw/cmssw/blob/a3479e2edf6588f3227d707f8db676aa2ab59cbe/HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py#L501 I was asking because for any non-temporary storage of these data products it would be cleanest to record the failure by storing the "not-produced" product, and was wondering how much of this use case needs to be covered.

makortel commented 1 year ago

If there is interest in the following kind of behavior, we could look into evolving the SkipEvent exception handling behavior in the framework accordingly: (details in https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1554925070)

Does anyone have any feedback on this potential evolution of SkipEvent?

aloeliger commented 1 year ago

If there is interest in the following kind of behavior, we could look into evolving the SkipEvent exception handling behavior in the framework accordingly: (details in #41645 (comment))

Does anyone have any feedback on this potential evolution of SkipEvent?

For whatever my relatively in-expert opinion is worth, I think it's a good idea for exactly these cases. This event crashes, and crashes for a good reason. However, we should be able to crash events, without crashing entire processes.

missirol commented 1 year ago

Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing hltGtStage2Digis, the OutputModules will not store the events that the EDFilter rejects. Is this assumption correct?

In my understanding, it is.

But I'll add that I still have to study this issue, and it's not clear to me what the exact changes to the HLT menu would be because of it (esp. when it comes to the "copied-over-or-dummy product"). The "aforementioned EDFilter" sounds similar in purpose to the workaround tested in [1].

On top of the general issue, it would be useful (to me) to understand better some aspects that are specific to the L1T unpacker.

[1] CMSHLT-2793 (comment)

I also see them being stored as part of HLTDebugRAW

https://github.com/cms-sw/cmssw/blob/a3479e2edf6588f3227d707f8db676aa2ab59cbe/HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py#L167

and HLTDebugFEVT

https://github.com/cms-sw/cmssw/blob/a3479e2edf6588f3227d707f8db676aa2ab59cbe/HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py#L501

Right, thanks for pointing this out.

aloeliger commented 1 year ago

@mmusich is the file with this error still available to test the new filter on?

makortel commented 1 year ago

So, wouldn't it be enough to have an EDFilter that returns true or false and does not produce anything ?

The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in #41645 (comment)).

I should add that the set of consuming modules that one needs to worry about includes all modules in EndPaths (whose execution is not controlled by TriggerResultsFilter that would reject the event that would have the missing data product).

aloeliger commented 1 year ago

@mmusich I just want to ping you again. The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?

mmusich commented 1 year ago

@aloeliger

The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?

It seems you are not following the cmsTalk thread linked in the original issue description. See please https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996/6

aloeliger commented 1 year ago

@aloeliger

The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?

It seems you are not following the cmsTalk thread linked in the original issue description. See please https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996/6

My apologies for my absence, I was at US CMS last week. Thanks for providing this. I will start trying to test with the stored tarballs ASAP.

aloeliger commented 1 year ago

@mmusich Apologies, but I am still running into issues trying to recreate the error so I can test my solution. The tarball does not include the file that the original parameter set/process runs on. Is the file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2023C/JetMET0/RAW/v1/000/367/232/00000/9144de83-470d-499d-8973-700975c88eba.root still available somewhere I should be aware of?

mmusich commented 1 year ago

@aloeliger you'll have to see to reproduce with what is there or ask @germanfgv for clarifications.

mmusich commented 1 year ago

apparently it's only on tape:

$ dasgoclient -query='site file=/store/data/Run2023C/JetMET0/RAW/v1/000/367/232/00000/9144de83-470d-499d-8973-700975c88eba.root'
T0_CH_CERN_Tape
T1_US_FNAL_Tape

should be trivial to make a rucio rule to get it back though.

aloeliger commented 1 year ago

The rucio rule should be trivial... but whether I have enough good will at wisconsin left to get them to put this onto their disk after the stunts I've pulled is another story altogether. I'll see what I can do.

aloeliger commented 1 year ago

I take it back... the block that contains this is /JetMET0/Run2023C-v1/RAW#b852026e-a127-4075-9b0a-90c0bdb1d5e4, which is 2.5Tb, which is a little much any way you slice it to pull a single file off of tape for debugging. I don't suppose there is some secret debugging method of getting single files off of tape? Alternatively, I could just open a draft PR for the draft of the GT unpacker inspecting module if we want to work without an explicit test on it.

perrotta commented 1 year ago

Is anybody able to test if the fix in #42214 can solve this issue?

mmusich commented 1 year ago

Is anybody able to test if the fix in https://github.com/cms-sw/cmssw/pull/42214 can solve this issue?

Apparently it's not straightforward to get back the RAW data for the Tier0 job that failed, see https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1576913902 . If we are OK in getting a 2.5TB block at CERN, should be straightforward to check.

aloeliger commented 1 year ago

It may solve the immediate problem, but the point remains the vector shouldn't have existed in the first place. I think there is some evidence at this point that there is something going wrong in one of the muon unpackers. I would be hesitant to insert this solution and call anything "fixed"