cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

Crash in `PATLeptonTimeLifeInfoProducer.cc` in `CMSSW_14_0_6` #44862

Closed mandrenguyen closed 6 months ago

mandrenguyen commented 6 months ago

In a replay of CMSSW_14_0_6 a crash was reported here: https://cms-talk.web.cern.ch/t/replay-request-for-cmssw-14-0-6/39939/4

This crash occurs in https://cmssdt.cern.ch/lxr/source/PhysicsTools/PatAlgos/plugins/PATLeptonTimeLifeInfoProducer.cc and it happens at the following line: https://github.com/cms-sw/cmssw/blob/90e59579185f807ab8cc1fb4c92ba19c98d49ed1/PhysicsTools/PatAlgos/plugins/PATLeptonTimeLifeInfoProducer.cc#L172

One can reproduce the problem directly by executing PSet.py from the CMS talk post above, and adding the following line: process.source.eventsToProcess = cms.untracked.VEventRange("369998:31680062")

As this issue is blocking the deployment of a release with important bug-fixes for the HLT, the issue is urgent, and any help would be highly appreciated.

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @mandrenguyen.

@antoniovilela, @sextonkennedy, @makortel, @rappoccio, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mandrenguyen commented 6 months ago

assign reconstruction

cmsbuild commented 6 months ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

mandrenguyen commented 6 months ago

@cms-sw/egamma-pog-l2 @cms-sw/tracking-pog-l2 Maybe one of you would have some insight.

francescobrivio commented 6 months ago

simple recipe to reproduce:

cmsrel CMSSW_14_0_6
cd CMSSW_14_0_6/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay14_0_6/job_8586/job6/WMTaskSpace/cmsRun1/PSet.p* .

update the PSet with:

cat <<EOF >> PSet.py
    process.source.eventsToProcess = cms.untracked.VEventRange("369998:31680062")
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
EOF

run:

cmsRun PSet.py 
mmusich commented 6 months ago

@mbluj please take a look

mmusich commented 6 months ago

urgent

mmusich commented 6 months ago

I guess the easiest is to check if the closestState is valid before accessing any of its members.

mandrenguyen commented 6 months ago

I guess the easiest is to check if the closestState is valid before accessing any of its members.

I confirm that adding the following before the offending line allows the event to process successfully: if(!closestState.isValid()) return;

I will let others comment on whether that's an acceptable solution.

francescobrivio commented 6 months ago

I guess the easiest is to check if the closestState is valid before accessing any of its members.

I confirm that adding the following before the offending line allows the event to process successfully: if(!closestState.isValid()) return;

I will let others comment on whether that's an acceptable solution.

In case experts agree this is the best solution, here is a commit (to master) that can be quickly cherry-picked: https://github.com/francescobrivio/cmssw/commit/c150b8b7a174a8d93cb07fc79e2ec22667a0d43d

It emits this edm error:

%MSG-e PATLeptonTimeLifeInfoProducer:   PATElectronTimeLifeInfoProducer:electronTimeLifeInfos  29-Apr-2024 11:56:53 CEST Run: 369998 Event: 31680062
closestState not valid!
%MSG
mmusich commented 6 months ago

Out of curiosity what's

transTrack.impactPointState()

in the event that leads to the crash?

francescobrivio commented 6 months ago

I'm getting:

  impactPointState: 
global parameters
x =      0.116433    -0.181856     0.937492
p =       3.90438      0.18625      10.0547
global error
  0.000478242  8.74249e-07  8.21233e-05 -0.000239124 -1.87666e-05
  8.74249e-07  1.17157e-07  2.75897e-07 -1.21146e-06 -1.43013e-06
  8.21233e-05  2.75897e-07  1.75494e-05  -5.8975e-05 -5.83673e-06
 -0.000239124 -1.21146e-06  -5.8975e-05  0.000221227  2.56773e-05
 -1.87666e-05 -1.43013e-06 -5.83673e-06  2.56773e-05  2.10979e-05
local parameters (q/p,v',w',v,w)
   -0.0926974            0      2.57232            0            0
local error
  0.000478242  8.20722e-05  6.65901e-06 -0.000239124 -5.17932e-05
  8.20722e-05  1.75178e-05  2.07178e-06 -5.89051e-05 -1.59499e-05
  6.65901e-06  2.07178e-06  6.79698e-06 -9.22749e-06 -3.00633e-05
 -0.000239124 -5.89051e-05 -9.22749e-06  0.000221227  7.08658e-05
 -5.17932e-05 -1.59499e-05 -3.00633e-05  7.08658e-05  0.000160699
Defined at beforeSurface
Magnetic field in inverse GeV:  (-6.68797e-10,1.04459e-09,0.0114257) 

and from RecoVertex::convertPos(pv.position()) I get:

 convertPos:  (0.117078,-0.182911,0.95493) 
vlimant commented 6 months ago

assign xpog

cmsbuild commented 6 months ago

New categories assigned: xpog

@vlimant,@hqucms you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 commented 6 months ago

global parameters x = 0.116433 -0.181856 0.937492 p = 3.90438 0.18625 10.0547 convertPos: (0.117078,-0.182911,0.95493)

it's not clear why this state and target would fail propagation

mmusich commented 6 months ago

mmmh, I am getting different numbers using the recipe at https://github.com/cms-sw/cmssw/issues/44862#issuecomment-2082166554

diff --git a/PhysicsTools/PatAlgos/plugins/PATLeptonTimeLifeInfoProducer.cc b/PhysicsTools/PatAlgos/plugins/PATLeptonTimeLifeInfoProducer.cc
index 2e41063e3f2..f68730271ea 100644
--- a/PhysicsTools/PatAlgos/plugins/PATLeptonTimeLifeInfoProducer.cc
+++ b/PhysicsTools/PatAlgos/plugins/PATLeptonTimeLifeInfoProducer.cc
@@ -167,6 +167,8 @@ void PATLeptonTimeLifeInfoProducer<T>::produceAndFillIPInfo(const T& lepton,
     // Extrapolate track to the point closest to PV
     reco::TransientTrack transTrack = transTrackBuilder.build(track);
     AnalyticalImpactPointExtrapolator extrapolator(transTrack.field());
+
+    std::cout << __PRETTY_FUNCTION__ << " " << transTrack.impactPointState() << std::endl;
     TrajectoryStateOnSurface closestState =
         extrapolator.extrapolate(transTrack.impactPointState(), RecoVertex::convertPos(pv.position()));
     GlobalPoint pca = closestState.globalPosition();
void PATLeptonTimeLifeInfoProducer<T>::produceAndFillIPInfo(const T&, const TransientTrackBuilder&, const reco::Vertex&, TrackTimeLifeInfo&) [with T = pat::Electron] global parameters
x =       1.91865      34.2722      166.593
p =     -0.016365  0.000857716    0.0626035
global error
     0.812786   0.00612206   -0.0855089   -0.0504526    0.0847113
   0.00612206   0.00027459 -0.000491722  0.000820261   0.00174473
   -0.0855089 -0.000491722    0.0124496   0.00858109   -0.0126903
   -0.0504526  0.000820261   0.00858109    0.0129749  -0.00277318
    0.0847113   0.00174473   -0.0126903  -0.00277318    0.0202798
local parameters (q/p,v',w',v,w)
     -15.4529            0     -3.82022            0           -0
local error
     0.812786    0.0291878   -0.0981561    0.0504526    -0.334519
    0.0291878   0.00453957    0.0104463   0.00673732   0.00313124
   -0.0981561    0.0104463    0.0685205    0.0127032     0.109981
    0.0504526   0.00673732    0.0127032    0.0129749   -0.0109511
    -0.334519   0.00313124     0.109981   -0.0109511     0.316245
Defined at beforeSurface
Magnetic field in inverse GeV:  (1.92692e-06,3.442e-05,0.0112619) 
slava77 commented 6 months ago
x =       1.91865      34.2722      166.593
p =     -0.016365  0.000857716    0.0626035

this one make more sense to possibly fail a prop to PCA.

francescobrivio commented 6 months ago

Yea sorry, I do get the same numbers as Marco indeed! Not sure what I was printing exactly... I'll update my PR with the printouts as @slava77 suggested.

mbluj commented 6 months ago

Hello, I was completely off last week and I am reading it only now. Thank you for fixing the issue.

mandrenguyen commented 6 months ago

+1 We can consider this solved by #44864

francescobrivio commented 6 months ago

+1 We can consider this solved by #44864

Thanks Matt! For completeness this was solved by #44864 + #44875! (and the combined backport is #44869)