Open mmusich opened 1 year ago
assign dqm, l1
New categories assigned: dqm,l1
@tjavaid,@epalencia,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@aloeliger,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @mmusich Marco Musich.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
with a bit of offline diagnosis, the crash occurs here:
index
equals to 9, but the muons_eta_phi.size()
is 5.
This apparently happens because in that event (Run 367232, Event 378449946, LumiSection 190
) MuonBxCollection->getFirstBX()
is -10 and MuonBxCollection->getLastBX()
is 11, while in other events it seems to be constrained between -2 and 2.
There is now a second Prompt Reco paused job with same symptoms
urgent
There is now a second Prompt Reco paused job with same symptoms
Also the second crash is reproducible under CMSSW_13_0_5_patch1 on lxplus8
with the following PSet.py
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.options.numberOfThreads = 1
process.source.skipEvents=cms.untracked.uint32(680)
(taking the rest of the configuration here)
The exception message is:
----- Begin Fatal Exception 13-May-2023 18:36:50 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 367315 lumi: 40 event: 94344670 stream: 0
[1] Running path 'dqmoffline_1_step'
[2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 18446744073709551605) >= this->size() (which is 5)
----- End Fatal Exception -------------------------------------------------
Technically this change circumvents the crashes, though it would be good if the L1T / DQM experts find the root cause.
diff --git a/DQM/L1TMonitor/interface/L1TObjectsTiming.h b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
index d39f6a4fe96..42b3ff4616e 100644
--- a/DQM/L1TMonitor/interface/L1TObjectsTiming.h
+++ b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
@@ -38,6 +38,8 @@ protected:
void dqmBeginRun(const edm::Run&, const edm::EventSetup&) override;
void bookHistograms(DQMStore::IBooker&, const edm::Run&, const edm::EventSetup&) override;
void analyze(const edm::Event&, const edm::EventSetup&) override;
+ template <typename T>
+ bool checkBXCollection(const T*&);
private:
edm::EDGetTokenT<l1t::MuonBxCollection> ugmtMuonToken_;
diff --git a/DQM/L1TMonitor/src/L1TObjectsTiming.cc b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
index c8edfe65685..bb6cfefa67f 100644
--- a/DQM/L1TMonitor/src/L1TObjectsTiming.cc
+++ b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
@@ -1,4 +1,5 @@
#include "DQM/L1TMonitor/interface/L1TObjectsTiming.h"
+#include "FWCore/Utilities/interface/TypeDemangler.h"
L1TObjectsTiming::L1TObjectsTiming(const edm::ParameterSet& ps)
: ugmtMuonToken_(consumes<l1t::MuonBxCollection>(ps.getParameter<edm::InputTag>("muonProducer"))),
@@ -783,25 +784,44 @@ void L1TObjectsTiming::bookHistograms(DQMStore::IBooker& ibooker, const edm::Run
}
}
+template <typename T>
+bool L1TObjectsTiming::checkBXCollection(const T*& BXCollection) {
+ // check that all the BX collections do not exceed the expected range
+ if (BXCollection->getLastBX() - BXCollection->getFirstBX() > int(bxrange_ - 1)) {
+ edm::LogError("L1TObjectsTiming") << " Unexpected bunch crossing range in "
+ << edm::typeDemangle(typeid(BXCollection).name())
+ << " lastBX() = " << BXCollection->getLastBX()
+ << " - firstBX() = " << BXCollection->getFirstBX();
+ return false;
+ } else {
+ return true;
+ }
+}
+
void L1TObjectsTiming::analyze(const edm::Event& e, const edm::EventSetup& c) {
if (verbose_)
edm::LogInfo("L1TObjectsTiming") << "L1TObjectsTiming: analyze..." << std::endl;
// Muon Collection
- edm::Handle<l1t::MuonBxCollection> MuonBxCollection;
- e.getByToken(ugmtMuonToken_, MuonBxCollection);
+ const l1t::MuonBxCollection* MuonBxCollection = &e.get(ugmtMuonToken_);
+ if (!checkBXCollection(MuonBxCollection))
+ return;
// Jet Collection
- edm::Handle<l1t::JetBxCollection> JetBxCollection;
- e.getByToken(stage2CaloLayer2JetToken_, JetBxCollection);
+ const l1t::JetBxCollection* JetBxCollection = &e.get(stage2CaloLayer2JetToken_);
+ if (!checkBXCollection(JetBxCollection))
+ return;
// EGamma Collection
- edm::Handle<l1t::EGammaBxCollection> EGammaBxCollection;
- e.getByToken(stage2CaloLayer2EGammaToken_, EGammaBxCollection);
+ const l1t::EGammaBxCollection* EGammaBxCollection = &e.get(stage2CaloLayer2EGammaToken_);
+ if (!checkBXCollection(EGammaBxCollection))
+ return;
// Tau Collection
- edm::Handle<l1t::TauBxCollection> TauBxCollection;
- e.getByToken(stage2CaloLayer2TauToken_, TauBxCollection);
+ const l1t::TauBxCollection* TauBxCollection = &e.get(stage2CaloLayer2TauToken_);
+ if (!checkBXCollection(TauBxCollection))
+ return;
// EtSum Collection
- edm::Handle<l1t::EtSumBxCollection> EtSumBxCollection;
- e.getByToken(stage2CaloLayer2EtSumToken_, EtSumBxCollection);
+ const l1t::EtSumBxCollection* EtSumBxCollection = &e.get(stage2CaloLayer2EtSumToken_);
+ if (!checkBXCollection(EtSumBxCollection))
+ return;
// Open uGT readout record
edm::Handle<GlobalAlgBlkBxCollection> uGtAlgs;
@mmusich I'll treat this as a priority tomorrow.
Okay, a quick 15 minute dissection shows that the crashing bxvector here is (at least initially) the muon bxvector. It has 22 BX's in it (-10 to 11) (instead of the expected 5), and is quite (suspiciously?) full in general:
Size per BX: -10: 1 -9: 2 -8: 2 -7: 2 -6: 2 -5: 2 -4: 2 -3: 2 -2: 2 -1: 2 0: 4 1: 2 2: 2 3: 2 4: 2 5: 2 6: 1 7: 2 8: 2 9: 2 10: 1 11: 1
It is also worth noting under this set-up, we do get a previous event, which shows the proper 5 BX's:
Size per BX: -2: 0 -1: 2 0: 0 1: 0 2: 0
My knee-jerk reaction seeing this is that it is data corruption, perhaps related to the ongoing uGT packer/unpacker problem, but I have no definitive proof of that.
The exact crash of course occurs here,
It finds some theoretically valid muon in BX 7:
Muon in collection that passes qual cuts Muon->pt(): 61 Muon->hwQual(): 12
and then comes up with an "index" of 9 for it in the vector of histograms in that arcane equation there. It then tries to access muons_eta_phi
at this index, which was previously created in this loop here:
coincidentally, bxrange_
is a const
hard coded to 5 in the member initializers here:
instead of being some constexpr
or something.
And when it tries to access the location defined by a bxrange of 5, at index 9, you get the usual c++ vector issue.
This is going to crash (and crash un-informatively) therefore on anything with BXVectors with BX's outside of the expected 5.
As an intermediate step, I could test each of the input BXVectors to make sure they are in the expected range, and throw a better cms::exception
making the assumptions of this clear, but I can't explain where the messed up BXVector came from yet. @mmusich would it be a good idea to insert such a test/exception into this?
the collection these muons are loaded out of is defined by muonProducer
here:
which is pretty much directly the gtStage2 digis. I could try to trace it back further, but that is pretty much just the raw to digi step at that point:
(which is some evidence for corrupted data)
@aloeliger
would it be a good idea to insert such a test/exception into this?
I think so. Incidentally that's (almost) what I was proposing at https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1546902922. Though I (personally) wouldn't crash the entire Prompt Reco because of a mismatch with the expected data format in a DQM module. Emitting a LogError
might be better suited for this case.
@mmusich forgive my inexperience on this but are cms::exceptions
catchable/ignorable, or can you force the event to skip at the configuration level?
I'm not sure we want to continue computation on some event we know has corrupted uGT data. We're just asking to crash somewhere else at that point, or continuing processing on junk.
or can you force the event to skip at the configuration level?
this is not what what happens at Tier-0 and in general it's not a good policy because of the reasons explained in the thread at https://github.com/cms-sw/cmssw/issues/41512. Either the data is so corrupted that an exception is thrown (with consequent complete disruption of the job) or you keep processing the junk (as in your words). The question is if the data corruption discussed here is severe enough to declare the event completely unusable for physics. If not (from the tests I did circumventing the crash, it seems the processing just continues without consequences) it would be better to perhaps handle the corrupt data at the unpacker level and create empty collections and let the clients deal with it ~ naturally.
FWIW I agree that it would be better to consider the event as bad and skip it.
It is possible to throw a cms exception and configure the job to skip the event. However it's not obvious how to avoid writing a partially reconstructed event to the output files.
Maybe the Core Software group has some suggestions.
.A
Well, the L1 packer/unpacker pretty clearly needs some robustness upgrades in any case, but that is a longer term proposition.
I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.
@cms-sw/core-l2
I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.
please advise, also in view of https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1547782124
Either the data is so corrupted that an exception is thrown (with consequent complete disruption of the job) or you keep processing the junk (as in your words). The question is if the data corruption discussed here is severe enough to declare the event completely unusable for physics. If not (from the tests I did circumventing the crash, it seems the processing just continues without consequences) it would be better to perhaps handle the corrupt data at the unpacker level and create empty collections and let the clients deal with it ~ naturally.
The problem with this approach is that the main consumer of the L1T information is the HLT, and the HLT will not be re-run during or after offline processing.
So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements.
So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements.
We agree, but then planning an exception in a DQM module that runs in Prompt (2 days after data is taken) is less then useless.
@cms-sw/core-l2
I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.
please advise, also in view of #41645 (comment)
Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.
While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in https://github.com/cms-sw/cmssw/issues/41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle
process.options.SkipEvent = cms.untracked.vstring("<exception category>")
should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be OutputModule.SelectEvents.SelectEvents
or inserting TriggerResultsFilter
in front of the OutputModule in the EndPath.
Well, It isn't a fast solution, but I can take a look at the unpacker and see if there's a way to provide proper null sets out of the unpacker if it doesn't pass a few basic quality tests.
Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)
Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this)
It may be just the muons. The muons were simply the first thing that crashed. I can check other collections for corruption too.
@cms-sw/core-l2
I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate.
please advise, also in view of #41645 (comment)
Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event.
While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in #41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle
process.options.SkipEvent = cms.untracked.vstring("<exception category>")
should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be
OutputModule.SelectEvents.SelectEvents
or insertingTriggerResultsFilter
in front of the OutputModule in the EndPath.
@makortel I didn't have a lot of time to think about this yesterday, but could I borrow your judgement on how to go about unpacker modifications for this?
If I go through the various GT unpackers and add a step into the process where it checks it's own output for say, having 5 BX's as a check on whether the data is corrupted, I could conceivably log an error and force it to instead output some empty BXVector that is unlikely to disturb anything downstream of it, but I'm not sure, without looking at the log errors by eye, how one catches this and differentiates this event as a problem, instead of an event with some genuinely empty vector. I agree with @gpetruc that someone really does need to catch this but without disturbing other elements of the reconstruction, but I am concerned that simply outputting empty collections may help to bury the problem rather than catching it, alerting, and gracefully continuing.
I entertained the idea of some sort of module-in-the-middle style solution that would run after the unpackers in the RawToDigi sequence to inspect and catch any potential data corruption and insert a bit or something into the output that could be used to flag corruption in the unpacker so that other downstream clients could catch this, or HLT could filter on it, but I would imagine this is more trouble than it is worth, involves sensitive modifications to everything reliant on the RawToDigi step, and is also not truly a solution to the issue of some downstream something attempting to process on bad data. I guess I would also be terrified of potentially introducing a mistaken 100% removal filter into the process because of bad configuration of this.
The existing muon unpacker responsible for GT muons, currently shows some corruption checks that are designed to throw exceptions:
... but from the sounds of it, this is pretty off the table at this point, having made it to this stage?
How is data corruption handled in unpackers of other sub-detectors? (bringing in @cms-sw/reconstruction-l2)
After discussing with @Dr15Jones we came to the conclusion that from the framework point of view using "empty containers" is not a good way to communicate processing failures, because there is no way to distinguish that from the container being empty for "legitimate reasons". It would be better if the consuming modules would be able to ask questions like "is the input data product valid" and "if the input data is not valid, why is it not". In principle the edm::Handle
already provides an interface for a consuming module to ask such questions (isValid()
, whyFailed()
), but we currently use those to communicate only that the data product is missing.
If there is interest in the following kind of behavior, we could look into evolving the SkipEvent
exception handling behavior in the framework accordingly:
A
throws an exception of a specific, configurable category, then
A
's data products (directly or transitively) would not be run, except the following:
A
(or any module depending on A
) is are marked as failed (as if A
would be a filter that rejected the event) and no further modules in the Path will be scheduled to run. Modules that have already been scheduled and are independent of A
will still be run.edm::Handle::whyFailed()
TriggerResults
(need to think which of "fail" or "error" would be better)A
(or any module depending on A
) would be run normally.A
, would not store any data for the event where the exception was thrown (unless configured to store such events; although while writing this now I'm actually not sure which behavior would be best as default)A
would not store any data for the event where the exception was thrownIn the mean time, similar behavior can to large degree accomplished by
A
that wants to report the processing error does not actually produce its data products for the "failing" event (i.e. does not call Event::put()
or Event::emplace()
)A
exists, otherwise rejects the eventA
if they exist, otherwise inserts default-constructed data products.A
, and whose scheduling is not controlled by the filter (i.e. they are in an independent Path or unscheduled). Such consuming modules are assumed to be able to deal with a default-constructed data product without throwing exceptions, but might throw ProductNotFound
exception if the product would be missing. (I don't personally like much of this part, but it does give a way to avoid changing the consuming modules to check if an input product is missing in a situation where the modules are scheduled irrespective of the filter)Okay. Then, if I am understanding correctly, L1 needs to:
put
or emplace
anything into the eventIn the long run the first set of things does seem ideal to me for this sort of situation, but I would like to help with the immediate solution first.
@mmusich @makortel Following what I think was the desired short term solution, I have created a filter for the GT digis that will check for corruption (in this case, right now this is only defined as having output BX vectors with size different than a configured size), and will also attempt to produce either an empty BXvector if corruption is detected, or an identical BX vector in the case that it is not. The EDFilter will return false when any corruption of this kind is detected, and true whenever there is no corruption.
@makortel my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?
I will attempt to test this on the current problem recipe in any case.
Hi @aloeliger,
my impression is that, if you wrote an EDFilter
, making a copy of the collections is redundant:
So, wouldn't it be enough to have an EDFilter
that returns true
or false
and does not produce anything ?
So, wouldn't it be enough to have an
EDFilter
that returnstrue
orfalse
and does not produce anything ?
The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1554925070).
my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this?
Correct. Although from the framework perspective it would be better if the actual producer of GT digis would not produce the data product in case of the error, because then the error condition is more clear for all consumers of the data product.
Do I assume correctly that the GT digis are not stored as such on any data tier?
Do I assume correctly that the GT digis are not stored as such on any data tier?
Fwiw, HLT currently stores them in the output of certain data streams (looking for keep *_hltGtStage2Digis_*_*
in OnLine_HLT_GRun.py
).
Do I assume correctly that the GT digis are not stored as such on any data tier?
Fwiw, HLT currently stores them in the output of certain data streams (looking for
keep *_hltGtStage2Digis_*_*
inOnLine_HLT_GRun.py
).
Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing hltGtStage2Digis
, the OutputModules will not store the events that the EDFilter rejects. Is this assumption correct?
I also see them being stored as part of HLTDebugRAW
https://github.com/cms-sw/cmssw/blob/a3479e2edf6588f3227d707f8db676aa2ab59cbe/HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py#L167
and HLTDebugFEVT
https://github.com/cms-sw/cmssw/blob/a3479e2edf6588f3227d707f8db676aa2ab59cbe/HLTrigger/Configuration/python/HLTrigger_EventContent_cff.py#L501
I was asking because for any non-temporary storage of these data products it would be cleanest to record the failure by storing the "not-produced" product, and was wondering how much of this use case needs to be covered.
If there is interest in the following kind of behavior, we could look into evolving the
SkipEvent
exception handling behavior in the framework accordingly: (details in https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1554925070)
Does anyone have any feedback on this potential evolution of SkipEvent
?
If there is interest in the following kind of behavior, we could look into evolving the
SkipEvent
exception handling behavior in the framework accordingly: (details in #41645 (comment))Does anyone have any feedback on this potential evolution of
SkipEvent
?
For whatever my relatively in-expert opinion is worth, I think it's a good idea for exactly these cases. This event crashes, and crashes for a good reason. However, we should be able to crash events, without crashing entire processes.
Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing
hltGtStage2Digis
, the OutputModules will not store the events that the EDFilter rejects. Is this assumption correct?
In my understanding, it is.
But I'll add that I still have to study this issue, and it's not clear to me what the exact changes to the HLT menu would be because of it (esp. when it comes to the "copied-over-or-dummy product"). The "aforementioned EDFilter" sounds similar in purpose to the workaround tested in [1].
On top of the general issue, it would be useful (to me) to understand better some aspects that are specific to the L1T unpacker.
HLTPrescaler
described towards the end of [1]).L1_SingleEG*
(otherwise that crashing module would not be executed in the Path), but there were no L1T e/g candidates (thus the crash). Assuming this is the case, does it make sense for the digis to contain such inconsistencies, or is there a way to deal with them in the unpacker (e.g. if there are no e/g candidates, change L1_SingleEG*
to false
) ?I also see them being stored as part of
HLTDebugRAW
and
HLTDebugFEVT
Right, thanks for pointing this out.
@mmusich is the file with this error still available to test the new filter on?
So, wouldn't it be enough to have an
EDFilter
that returnstrue
orfalse
and does not produce anything ?The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in #41645 (comment)).
I should add that the set of consuming modules that one needs to worry about includes all modules in EndPaths (whose execution is not controlled by TriggerResultsFilter
that would reject the event that would have the missing data product).
@mmusich I just want to ping you again. The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?
@aloeliger
The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?
It seems you are not following the cmsTalk thread linked in the original issue description. See please https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996/6
@aloeliger
The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution?
It seems you are not following the cmsTalk thread linked in the original issue description. See please https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996/6
My apologies for my absence, I was at US CMS last week. Thanks for providing this. I will start trying to test with the stored tarballs ASAP.
@mmusich Apologies, but I am still running into issues trying to recreate the error so I can test my solution. The tarball does not include the file that the original parameter set/process runs on. Is the file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2023C/JetMET0/RAW/v1/000/367/232/00000/9144de83-470d-499d-8973-700975c88eba.root
still available somewhere I should be aware of?
@aloeliger you'll have to see to reproduce with what is there or ask @germanfgv for clarifications.
apparently it's only on tape:
$ dasgoclient -query='site file=/store/data/Run2023C/JetMET0/RAW/v1/000/367/232/00000/9144de83-470d-499d-8973-700975c88eba.root'
T0_CH_CERN_Tape
T1_US_FNAL_Tape
should be trivial to make a rucio rule to get it back though.
The rucio rule should be trivial... but whether I have enough good will at wisconsin left to get them to put this onto their disk after the stunts I've pulled is another story altogether. I'll see what I can do.
I take it back... the block that contains this is /JetMET0/Run2023C-v1/RAW#b852026e-a127-4075-9b0a-90c0bdb1d5e4
, which is 2.5Tb, which is a little much any way you slice it to pull a single file off of tape for debugging. I don't suppose there is some secret debugging method of getting single files off of tape? Alternatively, I could just open a draft PR for the draft of the GT unpacker inspecting module if we want to work without an explicit test on it.
Is anybody able to test if the fix in #42214 can solve this issue?
Is anybody able to test if the fix in https://github.com/cms-sw/cmssw/pull/42214 can solve this issue?
Apparently it's not straightforward to get back the RAW data for the Tier0 job that failed, see https://github.com/cms-sw/cmssw/issues/41645#issuecomment-1576913902 . If we are OK in getting a 2.5TB block at CERN, should be straightforward to check.
It may solve the immediate problem, but the point remains the vector shouldn't have existed in the first place. I think there is some evidence at this point that there is something going wrong in one of the muon unpackers. I would be hesitant to insert this solution and call anything "fixed"
Dear all, there is one job failing Prompt Reco for run 367232, datatset
JetMET0
with aFatal Exception
as described in https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996The crash seems to originate from the module
L1TObjectsTiming
:The exception is reproducible on a
lxplus8
node underCMSSW_13_0_5_patch2
(el8_amd64_gcc11
). Full logs andPSet.py
can be found at https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run367232_JetMET0/Reco/vocms014.cern.ch-415905-3-log.tar.gz With this modifiedPSet.py
file the crash occurs immediately:It should be noted that the crash is preceded by these warning (perhaps related):