cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

GsfTracking crash at HLT in run 359686 #39570

Closed swagata87 closed 1 year ago

swagata87 commented 2 years ago

This issue is related to https://github.com/cms-sw/cmssw/issues/39026 in the sense that similar crashes happened before and a fix for that was put in https://github.com/cms-sw/cmssw/pull/39074. It is unclear to me why the crash came back despite the fix.

Some details of the crash is here

With the error stream file, I could reproduce the crash in GPU machines.

The error stream file(.raw), which is available in Hilton nodes, is also copied to this location: /eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw

In case it is useful, the file is also available in .root format: /eos/cms/store/group/phys_egamma/swmukher/outputFileGSF.root

Recipe to reproduce the crash in GPU machine is given below:

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py 

It crashes with the following messages:

%MSG-i CUDAService:  (NoModuleName) 01-Oct-2022 22:49:54 CEST pre-events
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
CUDA device 1: Tesla T4 (sm_75)
%MSG
.....
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:hltEgammaGsfTracks  01-Oct-2022 22:50:29 CEST Run: 359686 Event: 165691107
KF updated state 4 is invalid. skipping.
.....
----- Begin Fatal Exception 01-Oct-2022 22:50:29 CEST-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359686 lumi: 263 event: 165691107 stream: 0
   [1] Running path 'DST_Run3_PFScoutingPixelTracking_v18'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------
cmsbuild commented 2 years ago

A new Issue was created by @swagata87 Swagata Mukherjee.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

swagata87 commented 2 years ago

assign reconstruction, tracking-pog, egamma-pog

cmsbuild commented 2 years ago

New categories assigned: tracking-pog,reconstruction,egamma-pog

@slava77,@mmusich,@mandrenguyen,@lfinco,@clacaputo,@swagata87 you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 2 years ago

the recipe at https://github.com/cms-sw/cmssw/issues/39570#issue-1393682784 when run on lxplus-gpu results in:

Can not stat file (2):-/store/error_stream/run359686/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw
%MSG
----- Begin Fatal Exception 03-Oct-2022 14:00:58 CEST-----------------------
An exception of category 'FedRawDataInputSource:threadError' occurred while
   [0] Calling InputSource::getNextItemType
Exception Message:
 file reader thread error 
----- End Fatal Exception -------------------------------------------------

please create a recipe that can run without having access to hilton nodes.

bmjoshi commented 2 years ago

@mmusich Please use process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']

bmjoshi commented 2 years ago

This issue is not reproducible on lxplus-node. However, it is reproducible on GPU cluster at Milan. The difference that I found was the CUDA version. Please find the driver info and version below: LXPLUS-GPU: CUDA runtime version 11.5, driver version 11.7, NVIDIA driver version 515.65.01 MILAN-GPU@P5: CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03

mmusich commented 2 years ago

with this recipe:

(after ssh -X -Y <user>@lxplus-gpu.cern.ch)

cmsrel CMSSW_12_4_9 
cd CMSSW_12_4_9/src 
cmsenv
hltConfigFromDB --runNumber 359686 > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359686_ls0263_index000081_fu-c2b03-27-01_pid2808063.raw']
@EOF
cmsRun hlt.py > & out.out &

I am unable to reproduce the error. You can find the resulting log file here: http://musich.web.cern.ch/musich/Issue-39570/out.out

mmusich commented 2 years ago

-1

swagata87 commented 2 years ago

Just to keep track, there was another similar crash in HLT in run 359806. Elog: http://cmsonline.cern.ch/cms-elog/1158449

It was possible to reproduce the crash in GPU Hilton machine using the error stream file. The recipe is same as given in the issue description; but only change is the filename: process.source.fileNames = ['file:/store/error_stream/run359806/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw'] The crash reproduced in hilton can be viewed here: https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_GPU_hilton_c2b01_44_01.log

As before, this crash was also NOT reproduced in lxplus-gpu. To run on lxplus, one can use this file which I copied to /eos: process.source.fileNames = ['file:/eos/cms/store/group/phys_egamma/swmukher/run359806_ls0197_index000214_fu-c2b03-36-01_pid4120185.raw'] The log file is here, which shows that there is no crash when ran in lxplus-gpu; https://swmukher.web.cern.ch/swmukher/hlt_run359806_gsf_lxplus_gpu_764.log

swagata87 commented 2 years ago

The crash goes away if we add the full Sylvester's criterion here: https://github.com/cms-sw/cmssw/blob/5fb55fe3444c124a47661df4cc0168279ec55769/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc#L34-L36 ie; with this patch the crash is gone: (ignore the std::cout, which was just for check)

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index 5e9e6b31389..e3035361a5c 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -32,8 +32,17 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);

     double det = 0;
-    if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
-        (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0) {
+    double det22 = 0;
+    double det33 = 0;
+    double det44 = 0;
+
+    if ( updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det22) && det22 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det33) && det33 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det44) && det44 > 0) &&
+        (updatedTSOS.curvilinearError().matrix().Det2(det)) && det > 0)  {
+      std::cout << "all TSOS condition passed so will do result.addState.. " << std::endl;
+      
       result.addState(TrajectoryStateOnSurface(weights[i],
missirol commented 2 years ago

FYI: @cms-sw/hlt-l2 @cms-sw/heterogeneous-l2

(HLT crashes, seemingly specific to reconstruction on GPUs)

swagata87 commented 2 years ago

There was another HLT crash due to GSF with similar error message, in run 359871. Relevant Elog: http://cmsonline.cern.ch/cms-elog/1158552 The crash was reproducible in hilton machine; this was the error message in hilton; using error stream file /store/error_stream/run359871/run359871_ls0093_index000227_fu-c2b01-39-01_pid3885773.raw

%MSG
----- Begin Fatal Exception 06-Oct-2022 14:46:35 -----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 359871 lumi: 93 event: 193950756 stream: 0
   [1] Running path 'HLT_DoubleEle9_eta1p22_mMax6_v2'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

With the fix mentioned above (https://github.com/cms-sw/cmssw/issues/39570#issuecomment-1268610509), the crash goes away.

I will prepare a PR.

swagata87 commented 2 years ago

Given the fix PR is merged in master, I'll prepare backport to 12_5_X and 12_4_X.

Similar GSF crashes is expected to happen rarely in long collision runs until the fix is online. Another example crash is here, from Friday's overnight run: http://cmsonline.cern.ch/cms-elog/1159020. This one was also reproducible using error_stream file, and the crash goes away with the fix.

I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.

fwyzard commented 2 years ago

I guess we have enough evidence now that proves that the fix works. So I probably won't report the same crash anymore, as they are expected to happen until the fix is in.

I expect the HLT DOCs will anyway document them ?

swagata87 commented 2 years ago

+1

alintulu commented 1 year ago

This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .

mmusich commented 1 year ago

This crash happened again in two runs over the weekend (ELOG). Both were taken with 12_4_10_patch2 .

@alintulu you might want to report a recipe to reproduce (also offline).

swagata87 commented 1 year ago

Hi Marco, I was checking this a little bit. I could not reproduce the crash in run 360919, but could reproduce the crash in 360895. And what I found is that, your suggestion here , which was related to a different crash, can avoid this crash in 360895. But it prints some messages like this:

[swmukher@gputest-milan-01 src]$ cmsRun hlt.py 
.......
CUDA runtime version 11.5, driver version 11.6, NVIDIA driver version 510.47.03
CUDA device 0: Tesla T4 (sm_75)
......
%MSG-w BasicTrajectoryState:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =       37.6946     -22.2219      4.96195
p =       1.71548      -1.1439     0.328893
no error defined.

local error valid/values :0
[         -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan ]
%MSG
%MSG-w BasicTrajectoryState:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =       37.6946     -22.2219      4.96195
p =       1.71548      -1.1439     0.328893
no error defined.

local error valid/values :0
[         -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan
          -nan        -nan        -nan        -nan        -nan ]
%MSG
%MSG-w BasicTrajectoryState:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
BasicTrajectoryState: attempt to access errors when none available  accessing local error..
freestate pointer: parameters
x =          -nan         -nan         -nan
p =          -nan         -nan         -nan
no error defined.

local error valid/values :0
[          nan        -nan        -nan         nan         nan
          -nan        -nan        -nan         nan        -nan
          -nan        -nan        -nan         nan        -nan
           nan         nan         nan        -nan        -nan
           nan        -nan        -nan        -nan         nan ]
%MSG
%MSG-e PosteriorWeightsCalculator:   GsfTrackProducer:hltEgammaGsfTracksForBParking 23-Oct-2022 22:45:51 CEST  Run: 360895 Event: 783357885
PosteriorWeightsCalculator: sumWeight < DBL_MIN
%MSG
%MSG-w TrackNaN:  GsfTrackProducer:hltEgammaGsfTracksForBParking  23-Oct-2022 22:45:51 CEST Run: 360895 Event: 783357885
Track has NaN or the cov is not pos-definite
%MSG
fwyzard commented 1 year ago

+heterogeneous

mmusich commented 1 year ago

@swagata87 it looks like the same kind of problem also hit Prompt Reco.

There is a paused job for run 360888 in the ParkingDoubleMuonLowMass2 PD:

2022-10-24 10:15:40,863:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2022-10-24 10:15:40,874:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_10', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 8001

2022-10-24 10:15:40,874:CRITICAL:CMSSW:Error message: An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 360888 lumi: 199 event: 426362613 stream: 1
   [1] Running path 'dqmoffline_6_step'
   [2] Prefetching for module LogMessageMonitor/'TrackFinderLogMessageMonCommon'
   [3] Prefetching for module LogErrorHarvester/'logErrorHarvester'
   [4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions'
   [5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer'
   [6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors

Tier-0 will post the coordinates to reproduce shorty.

EDIT: details at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641

swagata87 commented 1 year ago

thanks, I've now managed to check the prompt-reco crash also, and I confirm that your patch solves the crash. So I think it makes sense to put in the fix to avoid further crashes.

Your fix was originally meant for a crash with message MultiTrajectoryState mixes states with different signs of local p_z, which happened both offline and HLT, and as far as I know was not possible to reproduce. Thus the fix was not put in back then. The same crash did not happen at HLT recently. But now the other crash MultiTrajectoryState mixes states with and without errors is also fixed with the same patch. So, it looks like these 2 crashes are after all related to each other?

We surely want to fix the crashes urgently, so let's go ahead with the PR, but I've copied the raw file so that it remain accessible, in case we want to do any other check
/eos/cms/store/group/phys_egamma/ec/swmukher/gsfcrashfile/00266a70-1a8f-4101-8391-ca99df9b51ca.root

The HLT error_stream files are also there, but none of them are reproducible in lxplus-gpu or lxplus(cpu). So it becomes generally difficult for people to debug with HLT error files, as one needs access to one of the P5 GPU machines. The prompt reco crash raw file should be easier to work with, as anyone can use it for debugging purpose in any usual lxplus machine.

At some point it would be great for egamma to understand these strange warnings triggered from BasicTrajectoryState here. It's unclear to me right now from where we are having missing error or nan matrix. But given the complexity of the Gsf algo it might take a while to fully understand it.

fwyzard commented 1 year ago

@mmusich can you explain to me what the suggested changes do ?

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index c3958e7f0c4..39a8da622be 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,8 +28,15 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
   MultiTrajectoryStateAssembler result;

   int i = 0;
+  float pzSign = 1.;
   for (auto const& tsosI : predictedComponents) {
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
+    if (i > 0 && pzSign * updatedTSOS.localParameters().pzSign() < 0) {
+      continue;
+    } else {
+      pzSign *= updatedTSOS.localParameters().pzSign();
+    }
+
     if (updatedTSOS.isValid() && updatedTSOS.localError().valid()) {
       result.addState(TrajectoryStateOnSurface(weights[i],
                                                updatedTSOS.localParameters(),

If updatedTSOS.localParameters().pzSign() is negative, won't this flip the sign of pzSign for each component ?

mmusich commented 1 year ago

is negative, won't this flip the sign of pzSign for each component ?

pzSign is a "check" variable. It starts a positive and then it gets update for each iteration (after the first) by multiplying it with the sign of the next state. Since, unless the charges of the state do not share the same sign, the sign of the product will be +, it essentially catches the states for which the sign gets changed (and removes them, since they lead to the mismatched pZ sign that than leads to a crash).

fwyzard commented 1 year ago

Mhm, I'm not sure I follow.

During the first iteration (when i == 0) the if clause is skipped and the else clause is always processed, so effectively pzSign is set to the sign of of KFUpdator().update(predictedComponents.front(), aRecHit).

Then for the following interations

So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?

Then, once a sign is chosen, should it be fixed and not changed any more ?

For example to use the sign of the first component, this loop seems safer:

    if (i == 0) {
      // select the pz sign of the first component
      pzSign = updatedTSOS.localParameters().pzSign();
    } else {
      // skip components with the "wrong" pz sign
      if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
        continue;
      }
    }
fwyzard commented 1 year ago

(of course, maybe I misunderstood the whole logic...)

fwyzard commented 1 year ago

Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?

mmusich commented 1 year ago

Then for the following interations

  • if pzSign has a different sign than updatedTSOS.localParameters().pzSign(), the component is skipped
  • otherwise, pzSign is multiplied by updatedTSOS.localParameters().pzSign(), which could flip it if the latter is negative.

right, I see it can become a problem when there are many states to be added with multiple sign flips. I was wrongly assuming that it would not try to add more than two states...

So, first question: if there are multiple components, why use the sign of the first one? Wouldn't it be better to use the sign that appear more times ?

this is a valid point, though one would need one more loop to collect all the signs.

Now, all of these checks may prevent the crashes, but I'm not sure if actually there is an underlying issue that one should address, and these are just symptoms ?

that's precisely my comment at: https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1232537510, see also https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1279266060.

mmusich commented 1 year ago

For example to use the sign of the first component, this loop seems safer:

   if (i == 0) {
     // select the pz sign of the first component
     pzSign = updatedTSOS.localParameters().pzSign();
   } else {
     // skip components with the "wrong" pz sign
     if (pzSign * updatedTSOS.localParameters().pzSign() < 0) {
       continue;
     }
   }

while logically correct, this doesn't solve the issue at https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641 :(

mmusich commented 1 year ago

So I have made a bit of archeology dig. The crash reported in prompt reco (cmsTalk) already appeared once this year (see https://github.com/cms-sw/cmssw/issues/39026) and I solved it by proposing https://github.com/cms-sw/cmssw/pull/39056. With the state of the code as of 2b81038ca2d3d3726589e58d1da24c1b7c8f0adf the newly reported crash doesn't happen either. Then, the same file (TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc) was touched twice by @swagata87:

with the state of the code as of https://github.com/cms-sw/cmssw/commit/f3afc63127880c397839549e52c7877ffd518b52 the crash in prompt reco doesn't happen, so I conclude that the issue has been re-introduce somehow by mistake in PR https://github.com/cms-sw/cmssw/pull/39656.

swagata87 commented 1 year ago

https://github.com/cms-sw/cmssw/pull/39656 is a rather simple PR. As I understand, all it does is to make sure that the states we are allowing to go further in the reco chain has a positive definite curvilinear error matrix, so that the states are more well behaved. This should not lead to the crash. We are probably missing something elsewhere.

Also, I noticed that the frequency of this crash at HLT has reduced after https://github.com/cms-sw/cmssw/pull/39656, but given that the crash has not completely disappeared, it's clear that there is something more to it which we are missing currently. I am open to suggestions and happy to try out any suggestions people might have on this.

The reco chain of electron and lowpt-electron is a bit different. So I was wondering if that has something to do with it. The problem could also be in lowpt-electron specific code/config. I have to admit I have not figured it out yet.

mmusich commented 1 year ago

This should not lead to the crash.

unfortunately it does (I invite you to cross-check)
After re-implementing the Sylvester criterion in a tidier way:

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index f3d6d173c10..dddb731e0e4 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -28,15 +28,46 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
   MultiTrajectoryStateAssembler result;

   int i = 0;
   for (auto const& tsosI : predictedComponents) {
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);

+    // in order to check for pos-def of error matrix
+    std::function<bool(AlgebraicSymMatrix55, int)> sylvesterCriterion = [this](AlgebraicSymMatrix55 curvError,
+                                                                               int index) {
+      //Sylvester's criterion, start from the smaller submatrix size
+      double det = 0;
+      if ((!curvError.Sub<AlgebraicSymMatrix22>(0, 0).Det(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub2.det for state " << index;
+        return false;
+      } else if ((!curvError.Sub<AlgebraicSymMatrix33>(0, 0).Det(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub3.det for state " << index;
+        return false;
+      } else if ((!curvError.Sub<AlgebraicSymMatrix44>(0, 0).Det(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check sub4.det for state " << index;
+        return false;
+      } else if ((!curvError.Det2(det)) || det < 0) {
+        edm::LogWarning("GsfMultiStateUpdator") << "Fail pos-def check det for state " << index;
+        return false;
+      }
+      // in case all the above was not realized
+      return true;
+    };
+
+    if (updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
+        sylvesterCriterion(updatedTSOS.curvilinearError().matrix(), i)) {
+      std::cout << " ###### keeping: " << i << " state" << std::endl;

-    if (double det;
-        updatedTSOS.isValid() && updatedTSOS.localError().valid() && updatedTSOS.localError().posDef() &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix22>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix33>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Sub<AlgebraicSymMatrix44>(0, 0).Det(det) && det > 0) &&
-        (det = 0., updatedTSOS.curvilinearError().matrix().Det2(det) && det > 0)) {
       result.addState(TrajectoryStateOnSurface(weights[i],
                                                updatedTSOS.localParameters(),
                                                updatedTSOS.localError(),

and running the reconstruction of the incriminated electron (see recipe at https://github.com/cms-sw/cmssw/issues/35929#issuecomment-1288800329) with the full Sylvester criterion and only the check on the full error matrix determinant (commenting out the first three clauses) I see that in one case there is one state less used for latter (the one with index 2):

2c2
< %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
11,12c11,12
< %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
< Fail pos-def check det for state 1
---
> %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 1
14c14
< %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
17c17
< %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
26c26,31
<  ###### keeping: 2 state
---
> %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 2
> %MSG
> %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> KF updated state 2 is invalid. skipping.
> %MSG
32c37
< %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
---
> %MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
41,42c46,47
< %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:30 CEST Run: 360888 Event: 426362613
< Fail pos-def check det for state 8
---
> %MSG-w GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
> Fail pos-def check sub2.det for state 8
44c49
< %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:36:31 CEST Run: 360888 Event: 426362613
---
> %MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  25-Oct-2022 14:25:05 CEST Run: 360888 Event: 426362613
49c54
<  ###### keeping: 11 state
---
>  ###### keeping: 11 state

and this leads to the crash (in Prompt).

francescobrivio commented 1 year ago

@swagata87 do you have any update on this? At the moment (luckily) only one job in Prompt is affected, but the sooner we fix this the better.

Cheers, Francesco as ORM

swagata87 commented 1 year ago

Hello Francesco,

Apart from what Marco has already pointed out, that removing the sub-matrices' positive definite check done by Sylvester criterion cures this crash, I don't have other fix right now.

That particular prompt reco event is puzzling, as, despite all protections in GsfMultiStateUpdator to make sure that matrices are positive definite, the event still prints warning messages of this kind[1], and it is unclear to me from where these states are coming from as we are doing a strict check to remove such states. And it's also unclear to me why keeping a non pos-def state is actually helping to cure a crash.

About https://github.com/cms-sw/cmssw/issues/39234#issuecomment-1279266060 , it would be helpful to know when this alignment update is going to be used for datataking? I'm just curious to see if that would solve the crashes also.

If I manage to gain more insight on this crash, I will write here.

[1]

%MSG-w BasicTrajectoryState:  GsfTrackProducer:lowPtGsfEleGsfTracks  26-Oct-2022 15:52:07 CEST Run: 360888 Event: 426362613
curv error not pos-def
[      13.4648-7.12423e-06 -0.00225463 0.000274802    0.102126
  -7.12423e-06-2.03937e-08 1.01283e-06-1.23354e-07-4.58423e-05
   -0.00225463 1.01283e-06 0.000320388-3.90183e-05  -0.0145005
   0.000274802-1.23354e-07-3.90183e-05 4.75568e-06  0.00176737
      0.102126-4.58423e-05  -0.0145005  0.00176737    0.656814 ]
pos/mom/mf  (-84.5635,-15.529,71.1933)   (0.00323092,-0.122849,-0.0301315)   (-0.0121003,-0.00222207,3.80835) 
mmusich commented 1 year ago

That particular prompt reco event is puzzling, as, despite all protections in GsfMultiStateUpdator to make sure that matrices are positive definite, the event still prints warning messages of this kind[1], and it is unclear to me from where these states are coming from as we are doing a strict check to remove such states. And

it might be worth noticing that messages as in [1] are triggered by this check:

https://github.com/cms-sw/cmssw/blob/2a2ca2ee216a44c2743301dc2dc32b4c7c93f4ad/TrackingTools/TrajectoryState/src/BasicTrajectoryState.cc#L94-L98

which doesn't implement the full Sylvester criterion , but a poor-man version implemented here:

https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/TrajectoryParametrization/interface/CurvilinearTrajectoryError.h#L51-L55

also I think that those messages are coming from an earlier stage of execution of the code than the filtering done here:

https://github.com/cms-sw/cmssw/blob/83e46c02e21a9d83b997f1c19a93b35b8695e689/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc#L34-L39

swagata87 commented 1 year ago

I still need to understand it better, by doing more checks; but looks like this patch[1] solve the prompt reco crash in that event in run 360888. So it is checking if LocalTrajectoryError of the single states are positive definite or not. If not, they are discarded. I checked it from 12_4_10_patch2.

Also, the warning messages that I was talking about are now gone, for this one event.

[1]

diff --git a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
index d4c6e1b11d2..701180aaeba 100644
--- a/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
+++ b/TrackingTools/GsfTracking/src/TsosGaussianStateConversions.cc
@@ -40,12 +40,14 @@ namespace GaussianStateConversions {
     std::vector<TrajectoryStateOnSurface> components;
     components.reserve(singleStates.size());
     for (auto const& ic : singleStates) {
-      components.emplace_back((*ic).weight(),
-                              LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
-                              LocalTrajectoryError((*ic).covariance()),
-                              surface,
-                              field,
-                              side);
+      if (double det=0; (*ic).covariance().Det2(det) && det>0 ) {
+       components.emplace_back((*ic).weight(),
+                               LocalTrajectoryParameters((*ic).mean(), pzSign, charged),
+                               LocalTrajectoryError((*ic).covariance()),
+                               surface,
+                               field,
+                               side);
+      }
     }
     return TrajectoryStateOnSurface((BasicTrajectoryState*)new BasicMultiTrajectoryState(components));
   }
swagata87 commented 1 year ago

I did some sanity checks by running on 200 events from EGamma RAW file /eos/cms/tier0/store/data/Run2022F/EGamma/RAW/v1/000/361/197/00000/76bd97fa-4ad6-4d85-b941-014e3ed27f9c.root , with and without the above patch. And then I compared pt of electron, lowpt-electron and photon(as converted photons also have gsf track). There were same number of these objects reco'ed with and w/o the patch, and the pt distribution was identical.

elfontan commented 1 year ago

Dear all, let me add for book-keeping the link to the e-log for a few additional crashes of this type happened during runs 361107, 361239, and 361297. Error streams have been saved:

/store/lustre/error/run361107:

/store/lustre/error/run361239:

/store/lustre/error/run361297

I tried to reproduce them using the following recipe in GPU Hilton machines with the first two cases without success:

cmsrel CMSSW_12_4_10_patch3
cd CMSSW_12_4_10_patch3/src
cmsenv
https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 361107 > hlt_run361107.py
cat >>  hlt_run361107.py <<@EOF
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
EvFDaqDirector = _EvFDaqDirector.clone(
    buBaseDir = '/store/error_stream',
    runNumber = 361107
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
source = _source.clone(
    fileListMode = True,
    fileNames = (
        ' /store/error_stream/run361107/run361107_ls0085_index000135_fu-c2b03-36-01_pid1607962.raw',
    )
)
@EOF
cmsRun hlt_run361107.py 

Follow up needed checking the last occurence in run 361297.

Cheers, --Elisa as secondary HLT DOC

swagata87 commented 1 year ago

Coming back to HLT crashes. Some of the recent Gsf crashes were not reproducible. However, now we have a HLT crash which I could reproduce in GPU machine gputest-milan-01 (as before, it is not reproducible in lxplus-gpu). The crash happened in run 361297 at 2022-10-28 23:49:21. Recipe to reproduce the crash:

cmsrel CMSSW_12_4_10_patch3
cd CMSSW_12_4_10_patch3/src/
cmsenv
hltConfigFromDB --runNumber 361297  > hlt.py
cat >> hlt.py <<@EOF
process.source.fileListMode = True
process.source.fileNames = [
'file:/store/error_stream/run361297/run361297_ls0295_index000109_fu-c2b01-23-01_pid786825.raw',
'file:/store/error_stream/run361297/run361297_ls0295_index000135_fu-c2b01-23-01_pid786825.raw'
]
@EOF
cmsRun hlt.py 

Crashes with this message:

----- Begin Fatal Exception 30-Oct-2022 11:53:36 CET-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing  Event run: 361297 lumi: 295 event: 541668964 stream: 0
   [1] Running path 'HLT_DoubleEle8p5_eta1p22_mMax6_v2'
   [2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracksForBParking'
Exception Message:
MultiTrajectoryState mixes states with and without errors
----- End Fatal Exception -------------------------------------------------

Then, I reran with my fix (https://github.com/cms-sw/cmssw/pull/39873) which was for a prompt-reco crash. It cures this HLT crash also. So, I believe that this fix will be useful for HLT as well.

What I learn from this whole exercise is that, apparently similar looking crashes can be triggered by different modules. So a fix for one crash does not necessarily fix other, although the error message will be same. With this fix integrated, the frequency of Gsf crash should reduce. Whether the crashes will completely go away or not, that is to be seen, and actions to be taken accordingly.

francescobrivio commented 1 year ago

There was another instance of the same issue (MultiTrajectoryState mixes states with and without errors) in Run 361971. I tried running the job in:

But I'd like the experts to test it as well. A minimal recipe to reproduce the crash is:

cmsrel CMSSW_12_4_11
cd CMSSW_12_4_11/src
cmsenv
git cms-addpkg TrackingTools/GsfTracking
git remote add swagataRepo git@github.com:swagata87/cmssw.git
git fetch swagataRepo
git cherry-pick aec3325132ddb2d2423dc872561e6ebf35934786
scram b -j 8
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022F/LogicError/job_3520411/vocms014.cern.ch-3520411-3-log.tar.gz .
tar -zxvf vocms014.cern.ch-3520411-3-log.tar.gz
cd job/WMTaskSpace/cmsRun1/

edit the PSet.py to be:

import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)
    process.options.numberOfThreads=cms.untracked.uint32(1)
    process.options.numberOfStreams=cms.untracked.uint32(1)
    process.source.eventsToProcess = cms.untracked.VEventRange('361971:435223126-361971:435223128')

run:

cmsRun PSet.py
mmusich commented 1 year ago

@francescobrivio your post seems misplaced here.
https://github.com/cms-sw/cmssw/issues/39987 is perhaps a better place to continue (even though the two issues are linked)

francescobrivio commented 1 year ago

@francescobrivio your post seems misplaced here. #39987 is perhaps a better place to continue (even though the two issues are linked)

Hi Marco, the original crash reported by Tier0 is the same that is being tracked in this issue (the mixes states, not the segFault), but I agree this also post should be propagated in the other issue, I will copy paste my message there as well!

mmusich commented 1 year ago

Hi Marco, the original crash reported by Tier0 is the same that is being tracked in this issue

No, this issue is about HLT, not sure how it got mixed with Tier0 issues

mmusich commented 1 year ago

actually @swagata87 do you know if there were further HLT crashes beyond https://github.com/cms-sw/cmssw/issues/39570#issuecomment-1296218732? In case not, I think this issue could be closed to avoid scattering the discussion in too many places.

swagata87 commented 1 year ago

Hello Marco, I have not seen any other GSF crash in HLT after that (the last one was in 361297 on 28th Oct). But given the crash is very rare and tend to come back after weeks, so I kept this issue open. But we can close it and if HLT Gsf problem happen again, we can open a new issue.

swagata87 commented 1 year ago

Closing this issue, after discussing with FOG convenor. If the crash comes back at HLT, a new issue will be opened.