cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

[AARCH64] Relvals failing due to nan/inf numbers #39267

Open smuzaffar opened 2 years ago

smuzaffar commented 2 years ago

Most of the errors reported in https://github.com/cms-sw/cmssw/issues/36788 were fixed by https://github.com/cms-sw/cmssw/pull/39183 . We still have 13 workflows failing with same error [a] . Looks like we still have some nan/inf number generation at https://github.com/cms-sw/cmssw/blob/master/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexUpdatorT.h#L156. Value of val and lambda here are

@cms-sw/reconstruction-l2 , can you please look in to this and provide a fix

[a]

GammaContinuedFraction::a too large, ITMAX too small
GammaContinuedFraction::a too large, ITMAX too small
----- Begin Fatal Exception 31-Aug-2022 03:57:14 CEST-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 1 lumi: 9 event: 431 stream: 3
   [1] Running path 'dqmofflineOnPAT_1_step'
   [2] Prefetching for module SingleTopTChannelLeptonDQM_miniAOD/'singleTopElectronMediumDQM_miniAOD'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module MuonProducer/'muons'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module PFConversionProducer/'pfConversions'
   [11] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list.
 pt used for comparison: nan
----- End Fatal Exception -------------------------------------------------
cmsbuild commented 2 years ago

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

smuzaffar commented 2 years ago

assign reconstruction

cmsbuild commented 2 years ago

New categories assigned: reconstruction

@jpata,@clacaputo,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

smuzaffar commented 2 years ago

by the way, the following check at https://github.com/cms-sw/cmssw/blob/master/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexUpdatorT.h#L141 allowed the workflow 10803.0 to run for el9_aarch64_gcc11

  val += g * delta_alpha;
  lambda = v_g_sym * val;
+ if (std::isnan(lambda[0]) || std::isinf(lambda[0])) {
+    return RefCountedKinematicVertex();
+  }

but I had to drop ofast-flag from https://github.com/cms-sw/cmssw/blob/master/RecoEgamma/EgammaPhotonAlgos/BuildFile.xml#L1 for isnan and isinf to work properly.

slava77 commented 2 years ago

but I also need to drop ofast-flag from

what about edm::isFinite?

smuzaffar commented 2 years ago

ah ok, let me try that. thanks @slava77

smuzaffar commented 2 years ago

thanks @slava77 , the following patch after https://github.com/cms-sw/cmssw/blob/master/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexUpdatorT.h#L141 (without dropping ofast-math dep) allowed the failing relval to run ( on both el8_amd64 and el9_aarch64)

+  if (! edm::isFinite(lambda[0])) {
+   //edm::LogWarning("KinematicConstrainedVertexUpdatorFailed") << "some error/warnings message\n";
+   //LogDebug("KinematicConstrainedVertexUpdatorFailed") << "some error/warning message\n";
+    return RefCountedKinematicVertex();
+  }

@cms-sw/reconstruction-l2 , if this is the correct fix then can you please open a PR with correct error/warning message ?

mandrenguyen commented 2 years ago

Well @smuzaffar , I would defer to @slava77 here as tracking POG convener. If it helps, I can make a pull request adding the message "Kinematic constrained vertex updator failed". Are you recommending just LogDebug or also LogWarning?

smuzaffar commented 2 years ago

@mandrenguyen , as a previous check https://github.com/cms-sw/cmssw/blob/master/RecoVertex/KinematicFit/interface/KinematicConstrainedVertexUpdatorT.h#L131-L135 has both LogDebug and LogWarning so that is why I suggested to add both for this check too

perrotta commented 2 years ago

urgent

makortel commented 1 year ago

The failure came back in workflow 14.0 step 3 in CMSSW_13_3_X_2023-08-21-2300 on el8_aarch64_gcc11

Begin processing the 95th record. Run 1, Event 2493, LumiSection 13 on stream 1 at 22-Aug-2023 05:43:16.615 CEST
GammaContinuedFraction::a too large, ITMAX too small
GammaContinuedFraction::a too large, ITMAX too small
----- Begin Fatal Exception 22-Aug-2023 05:43:17 CEST-----------------------
An exception of category 'Vertex' occurred while
   [0] Processing  Event run: 1 lumi: 13 event: 2494 stream: 3
   [1] Running path 'dqmofflineOnPAT_1_step'
   [2] Prefetching for module SingleTopTChannelLeptonDQM_miniAOD/'singleTopElectronMediumDQM_miniAOD'
   [3] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [4] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [5] Prefetching for module PATMuonProducer/'patMuons'
   [6] Prefetching for module MuonProducer/'muons'
   [7] Prefetching for module PFProducer/'particleFlowTmp'
   [8] Prefetching for module PFBlockProducer/'particleFlowBlock'
   [9] Prefetching for module PFElecTkProducer/'pfTrackElec'
   [10] Prefetching for module PFConversionProducer/'pfConversions'
   [11] Calling method for module ConversionProducer/'allConversions'
Exception Message:
Refitted track not found in list.
 pt used for comparison: -nan
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc11/CMSSW_13_3_X_2023-08-21-2300/pyRelValMatrixLogs/run/14.0_WpM/step3_WpM.log#/

Should we reopen this issue or open a new one?

smuzaffar commented 1 year ago

We can reopen this but not sure if github will close it again as #39298 (which claims to fix this issue) is merged. Lets reopen it and if github closes it then we can open a new one

smuzaffar commented 9 months ago

looks like this has been fixed. We have not seen this exceptions since long (few months). I would suggest to close this issue