Open gpetruc opened 3 months ago
cms-bot internal usage
A new Issue was created by @gpetruc.
@Dr15Jones, @rappoccio, @makortel, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Assign RecoMuon/GlobalTrackingTools
New categories assigned: reconstruction
@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
type muon
Adding @namapane as I believe the exception comes from the 14_0_X backport of https://github.com/cms-sw/cmssw/pull/42646
Adding @namapane as I believe the exception comes from the 14_0_X backport of #42646
TNX, looking into it.
The problem seems to be that this event has a PV which, despite being isValid(), has a ~zero covariance() matrix:
auto pv = pvHandle->at(0);
cout << pv.isValid() << endl << pv.covariance() << endl;
gives:
1
[ 6.30659e-07 2.46501e-08-1.17693e-06
2.46501e-08 5.49215e-07-3.72355e-07
-1.17693e-06-3.72355e-07 6.69129e-06 ]
This goes through SingleTrackVertexConstraint::constrain()
-> KalmanVertexTrackUpdator::update()
-> KVFHelper::vertexChi2
which trivially fails at this point.
The event also has a 2.8e9 GeV muon, so there must be something pathologic.
I can't think of a way of adding a simple protection to check for the covariance matrix to be sensible, so I think the easiest solution is to catch the exception in MuonBeamspotConstraintValueMapProducer. Let me know if you have objections or better suggestions.
In the meanwhile I made a PR for this in master: #45243 I suppose it needs to be backported to 14_0_X, if so let me know.
file is no more there
Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingVBF0/RAW/v1/000/381/515/00000/05c3f64e-bfd1-4969-af4a-c91a9ccd723f.root?eos.app=cmst0'
file is no more there
@germanfgv @LinaresToine please comment.
I managed to test the fix https://github.com/cms-sw/cmssw/pull/45243 before it disappeared. Not sure how to reproduce the problem again if you think that's not enough, unless the file can be retrieved somehow.
it seems there's another job that failed at Tier0 that crashed with similar features: https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run384981_JetMET1. I think the job was initially failing on AMD and then retried on Intel (on which it got past the crash, but somehow the job didn't finish correctly). @LinaresToine might give more details.
@namapane FYI.
Hello all
I have saved the tarball of the latest occurrence in
/eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException
I also saved the input root file in there so the error can be reproduced.
Thanks @mmusich for the heads up. I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix #45243?
I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix https://github.com/cms-sw/cmssw/pull/45243?
I think so, the job was run in CMSSW_14_0_14
which should have included https://github.com/cms-sw/cmssw/pull/45396 (entered CMSSW_14_0_12).
Is this parallel to https://github.com/cms-sw/cmssw/issues/45189 ? The new occurrence is for JetMET1, which seems to belong in the mentioned issue.
Is this parallel to https://github.com/cms-sw/cmssw/issues/45189 ?
What do you mean? This issue is 45189.
Thanks Marco, I meant https://github.com/cms-sw/cmssw/issues/45520. As you mentioned in cmstalk, they refer to different modules
I have saved the tarball of the latest occurrence in /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException I also saved the input root file in there so the error can be reproduced.
thanks, I can reproduce the crash (on an AMD machine, lxplus800
in my case) with the following script:
#!/bin/bash
export SCRAM_ARCH=el8_amd64_gcc12
scram p CMSSW_14_0_14
cd CMSSW_14_0_14/src
eval `scram runtime -sh`
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException/vocms0314.cern.ch-2761618-12-log.tar.gz .
tar xf vocms0314.cern.ch-2761618-12-log.tar.gz
cp -pr ./job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.source.skipEvents=cms.untracked.uint32(766)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log
This results immediately (at the first event) in:
----- Begin Fatal Exception 03-Sep-2024 09:37:22 CEST-----------------------
An exception of category 'VertexException' occurred while
[0] Processing Event run: 384981 lumi: 572 event: 1260938254 stream: 0
[1] Running path 'write_NANOAOD_step'
[2] Prefetching for module PoolOutputModule/'write_NANOAOD'
[3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
[4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix
----- End Fatal Exception -------------------------------------------------
With this simple patch:
diff --git a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
index 74459f475cb..a83f3d98268 100644
--- a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
+++ b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
@@ -65,15 +65,21 @@ private:
// Protect for mis-reconstructed beamspots (note that
// SingleTrackVertexConstraint uses the width for the constraint,
// not the error)
+
if ((BeamWidthXError / BeamWidthX < 0.3) && (BeamWidthYError / BeamWidthY < 0.3)) {
- SingleTrackVertexConstraint::BTFtuple btft =
- stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
- if (std::get<0>(btft)) {
- const reco::Track& trkBS = std::get<1>(btft).track();
- pts.push_back(trkBS.pt());
- ptErrs.push_back(trkBS.ptError());
- chi2s.push_back(std::get<2>(btft));
- tbd = false;
+ try {
+ SingleTrackVertexConstraint::BTFtuple btft =
+ stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
+
+ if (std::get<0>(btft)) {
+ const reco::Track& trkBS = std::get<1>(btft).track();
+ pts.push_back(trkBS.pt());
+ ptErrs.push_back(trkBS.ptError());
+ chi2s.push_back(std::get<2>(btft));
+ tbd = false;
+ }
+ } catch (const VertexException& exc) {
+ // Update failed; give up.
}
}
}
the crash that one can re-produce with the recipe at https://github.com/cms-sw/cmssw/issues/45189#issuecomment-2325811475 is circumvented.
I let @cms-sw/reconstruction-l2 to provide a patch to cmssw
in case it is useful and correct to implement it.
Hi @mmusich, the patch looks good, let me test it too to double check and I will implement it in CMSSW. If I understand correctly, it needs to be backported to 14_0_X, right?
Hi @24LopezR
he patch looks good, let me test it too to double check and I will implement it in CMSSW.
Thank you.
If I understand correctly, it needs to be backported to 14_0_X, right?
correct. It needs to go in 14_2_X (master), 14_1_X (for HIon) and 14_0_X (for pp).
+1
This issue is fully signed and ready to be closed.
A PromptReco job failure in the NanoAOD step was observed at the tier0 with the following error message cms-talk thead: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381515-parkingvbf0-vertexexception/42163
The exception appears to be reproducible running on a single event, but only on AMD: the job fails at Tier0 (AMD EPYC 7763) and on my desktop (AMD Ryzen 9 5950X), but not on another Intel machine I tested (Intel Xeon Silver 4216).
Instructions to reproduce it on an EL8 AMD machine: