cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

NanoAOD VertexException in PromptReco_Run381515_ParkingVBF0 (CMSSW 14_0_7, on AMD arch) #45189

Open gpetruc opened 3 months ago

gpetruc commented 3 months ago

A PromptReco job failure in the NanoAOD step was observed at the tier0 with the following error message cms-talk thead: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381515-parkingvbf0-vertexexception/42163

----- Begin Fatal Exception 11-Jun-2024 10:46:46 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 381515 lumi: 384 event: 765632765 stream: 0
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
   [4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix
----- End Fatal Exception -------------------------------------------------

The exception appears to be reproducible running on a single event, but only on AMD: the job fails at Tier0 (AMD EPYC 7763) and on my desktop (AMD Ryzen 9 5950X), but not on another Intel machine I tested (Intel Xeon Silver 4216).

Instructions to reproduce it on an EL8 AMD machine:

export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/vertexException/job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.eventsToProcess = cms.untracked.VEventRange("381515:384:765632765",)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log
cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @gpetruc.

@Dr15Jones, @rappoccio, @makortel, @smuzaffar, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 3 months ago

Assign RecoMuon/GlobalTrackingTools

cmsbuild commented 3 months ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

jfernan2 commented 3 months ago

type muon

gpetruc commented 3 months ago

Adding @namapane as I believe the exception comes from the 14_0_X backport of https://github.com/cms-sw/cmssw/pull/42646

namapane commented 3 months ago

Adding @namapane as I believe the exception comes from the 14_0_X backport of #42646

TNX, looking into it.

namapane commented 3 months ago

The problem seems to be that this event has a PV which, despite being isValid(), has a ~zero covariance() matrix:

          auto pv = pvHandle->at(0);
      cout << pv.isValid() << endl << pv.covariance() << endl;

gives:

1
[  6.30659e-07 2.46501e-08-1.17693e-06
   2.46501e-08 5.49215e-07-3.72355e-07
  -1.17693e-06-3.72355e-07 6.69129e-06 ]

This goes through SingleTrackVertexConstraint::constrain() -> KalmanVertexTrackUpdator::update() -> KVFHelper::vertexChi2 which trivially fails at this point. The event also has a 2.8e9 GeV muon, so there must be something pathologic.

I can't think of a way of adding a simple protection to check for the covariance matrix to be sensible, so I think the easiest solution is to catch the exception in MuonBeamspotConstraintValueMapProducer. Let me know if you have objections or better suggestions.

namapane commented 3 months ago

In the meanwhile I made a PR for this in master: #45243 I suppose it needs to be backported to 14_0_X, if so let me know.

VinInn commented 3 months ago

file is no more there

Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingVBF0/RAW/v1/000/381/515/00000/05c3f64e-bfd1-4969-af4a-c91a9ccd723f.root?eos.app=cmst0'
mmusich commented 3 months ago

file is no more there

@germanfgv @LinaresToine please comment.

namapane commented 3 months ago

I managed to test the fix https://github.com/cms-sw/cmssw/pull/45243 before it disappeared. Not sure how to reproduce the problem again if you think that's not enough, unless the file can be retrieved somehow.

mmusich commented 3 weeks ago

it seems there's another job that failed at Tier0 that crashed with similar features: https://cmsweb.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run384981_JetMET1. I think the job was initially failing on AMD and then retried on Intel (on which it got past the crash, but somehow the job didn't finish correctly). @LinaresToine might give more details.

@namapane FYI.

LinaresToine commented 3 weeks ago

Hello all I have saved the tarball of the latest occurrence in /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException I also saved the input root file in there so the error can be reproduced.

namapane commented 3 weeks ago

Thanks @mmusich for the heads up. I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix #45243?

mmusich commented 3 weeks ago

I am leaving for a 1 week holiday so I can check this one only when I'm back. I'm the meanwhile, did the job include the fix https://github.com/cms-sw/cmssw/pull/45243?

I think so, the job was run in CMSSW_14_0_14 which should have included https://github.com/cms-sw/cmssw/pull/45396 (entered CMSSW_14_0_12).

LinaresToine commented 3 weeks ago

Is this parallel to https://github.com/cms-sw/cmssw/issues/45189 ? The new occurrence is for JetMET1, which seems to belong in the mentioned issue.

mmusich commented 3 weeks ago

Is this parallel to https://github.com/cms-sw/cmssw/issues/45189 ?

What do you mean? This issue is 45189.

LinaresToine commented 3 weeks ago

Thanks Marco, I meant https://github.com/cms-sw/cmssw/issues/45520. As you mentioned in cmstalk, they refer to different modules

mmusich commented 3 weeks ago

I have saved the tarball of the latest occurrence in /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException I also saved the input root file in there so the error can be reproduced.

thanks, I can reproduce the crash (on an AMD machine, lxplus800 in my case) with the following script:

#!/bin/bash
export SCRAM_ARCH=el8_amd64_gcc12
scram p CMSSW_14_0_14
cd CMSSW_14_0_14/src
eval `scram runtime -sh`
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/VertexException/vocms0314.cern.ch-2761618-12-log.tar.gz .
tar xf vocms0314.cern.ch-2761618-12-log.tar.gz
cp -pr ./job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.skipEvents=cms.untracked.uint32(766)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log

This results immediately (at the first event) in:

----- Begin Fatal Exception 03-Sep-2024 09:37:22 CEST-----------------------
An exception of category 'VertexException' occurred while
   [0] Processing  Event run: 384981 lumi: 572 event: 1260938254 stream: 0
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATMuonFlatTableProducer/'muonTable'
   [4] Calling method for module MuonBeamspotConstraintValueMapProducer/'muonBSConstrain'
Exception Message:
BasicSingleVertexState::could not invert weight matrix 
----- End Fatal Exception -------------------------------------------------
mmusich commented 3 weeks ago

With this simple patch:

diff --git a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
index 74459f475cb..a83f3d98268 100644
--- a/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
+++ b/RecoMuon/GlobalTrackingTools/plugins/MuonBeamspotConstraintValueMapProducer.cc
@@ -65,15 +65,21 @@ private:
         // Protect for mis-reconstructed beamspots (note that
         // SingleTrackVertexConstraint uses the width for the constraint,
         // not the error)
+
         if ((BeamWidthXError / BeamWidthX < 0.3) && (BeamWidthYError / BeamWidthY < 0.3)) {
-          SingleTrackVertexConstraint::BTFtuple btft =
-              stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
-          if (std::get<0>(btft)) {
-            const reco::Track& trkBS = std::get<1>(btft).track();
-            pts.push_back(trkBS.pt());
-            ptErrs.push_back(trkBS.ptError());
-            chi2s.push_back(std::get<2>(btft));
-            tbd = false;
+          try {
+            SingleTrackVertexConstraint::BTFtuple btft =
+                stvc.constrain(ttkb->build(muon.muonBestTrack()), *beamSpotHandle);
+
+            if (std::get<0>(btft)) {
+              const reco::Track& trkBS = std::get<1>(btft).track();
+              pts.push_back(trkBS.pt());
+              ptErrs.push_back(trkBS.ptError());
+              chi2s.push_back(std::get<2>(btft));
+              tbd = false;
+            }
+          } catch (const VertexException& exc) {
+            // Update failed; give up.
           }
         }
       }

the crash that one can re-produce with the recipe at https://github.com/cms-sw/cmssw/issues/45189#issuecomment-2325811475 is circumvented. I let @cms-sw/reconstruction-l2 to provide a patch to cmssw in case it is useful and correct to implement it.

24LopezR commented 3 weeks ago

Hi @mmusich, the patch looks good, let me test it too to double check and I will implement it in CMSSW. If I understand correctly, it needs to be backported to 14_0_X, right?

mmusich commented 3 weeks ago

Hi @24LopezR

he patch looks good, let me test it too to double check and I will implement it in CMSSW.

Thank you.

If I understand correctly, it needs to be backported to 14_0_X, right?

correct. It needs to go in 14_2_X (master), 14_1_X (for HIon) and 14_0_X (for pp).

jfernan2 commented 3 weeks ago

+1

cmsbuild commented 3 weeks ago

This issue is fully signed and ready to be closed.