cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

Crash in GsfTrackProducer/'lowPtGsfEleGsfTracks' on ParkingDoubleMuonLowMass5 Prompt Reco dataset #39026

Closed emanueleusai closed 2 years ago

emanueleusai commented 2 years ago

Hello

We have two instances of PromptReco jobs for dataset ParkingDoubleMuonLowMass5 crashing when running dqm offline:

cmsRun1 Fatal Exception (Exit Code: 8001) An exception of category 'LogicError' occurred while [0] Processing Event run: 356948 lumi: 46 event: 59404076 stream: 5 [1] Running path 'dqmoffline_6_step' [2] Prefetching for module LogMessageMonitor/'TrackFinderLogMessageMonCommon' [3] Prefetching for module LogErrorHarvester/'logErrorHarvester' [4] Prefetching for module ConversionProducer/'gsfTracksOpenConversions' [5] Prefetching for module ConversionTrackProducer/'gsfTracksOpenConversionTrackProducer' [6] Calling method for module GsfTrackProducer/'lowPtGsfEleGsfTracks' Exception Message: MultiTrajectoryState mixes states with and without errors

Logs: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022C/DQMLogicError/vocms013.cern.ch-17054

This seems to be related to GsfTrackProducer/'lowPtGsfEleGsfTracks'. Can @cms-sw/egamma-pog-l2 @cms-sw/tracking-pog-l2 @nancymarinelli @sikler please look into it?

cmsbuild commented 2 years ago

A new Issue was created by @emanueleusai Emanuele Usai.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

emanueleusai commented 2 years ago

assign dqm,egamma,tracking

cmsbuild commented 2 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

emanueleusai commented 2 years ago

See cmsTalk post here: https://cms-talk.web.cern.ch/t/error-in-promptreco-for-run-356948-dataset-parkingdoublemuonlowmass5/13791

mmusich commented 2 years ago

assign reconstruction, tracking-pog, egamma-pog

cmsbuild commented 2 years ago

New categories assigned: tracking-pog,reconstruction,egamma-pog

@jpata,@slava77,@mmusich,@lfinco,@clacaputo,@swagata87,@vmariani you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 2 years ago

unassign dqm

mmusich commented 2 years ago

for the record the exception is triggered here: https://github.com/cms-sw/cmssw/blob/2a2ca2ee216a44c2743301dc2dc32b4c7c93f4ad/TrackingTools/GsfTools/src/BasicMultiTrajectoryState.cc#L19-L21

mmusich commented 2 years ago

@emanueleusai @tyjyang (ORM) have you tried to reproduce? Can you provide the first event that fails for ease of debugging?

mmusich commented 2 years ago

have you tried to reproduce? Can you provide the first event that fails for ease of debugging?

answering to self:

import FWCore.ParameterSet.Config as cms
from PSet import process
# skip to failing event
process.source.eventsToProcess = cms.untracked.VEventRange('356948:59404076-356948:59404076')
process.options.numberOfThreads = 1

will fail at the first event.

jordan-martins commented 2 years ago

Hi, this has(is) also being discussed here

mmusich commented 2 years ago

so, for starters I reconstructed the rather involved chain of calls. The exception originates here in the constructor of BasicMultiTrajectoryState

https://github.com/cms-sw/cmssw/blob/2a2ca2ee216a44c2743301dc2dc32b4c7c93f4ad/TrackingTools/GsfTools/src/BasicMultiTrajectoryState.cc#L19-L21

The constructor of BasicMultiTrajectoryState is called here by MultiTrajectoryStateAssembler::combinedState()

https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/GsfTools/src/MultiTrajectoryStateAssembler.cc#L86

which in turn is called by GsfMultiStateUpdator::update(...)

https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc#L46

which is called by GsfTrajectoryFitter::fitOne(...):

https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc#L126

which in turn is called by TrackProducerAlgorithm<reco::GsfTrack>::buildTrack(...)

https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/RecoTracker/TrackProducer/src/TrackProducerAlgorithm.cc#L268

which is called by TrackProducerAlgorithm<T>::runWithCandidate(...)

https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/RecoTracker/TrackProducer/interface/TrackProducerAlgorithm.icc#L72-L73

which finally is called by GsfTrackProducer::produce(..)

https://github.com/cms-sw/cmssw/blob/6f84b7b545d851cee15a6bdefff659548ac11002/RecoTracker/TrackProducer/plugins/GsfTrackProducer.cc#L76-L98

which is where the exception gets caught.

mmusich commented 2 years ago

with this patch:

diff --git a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
index 4b5010d3414..c3958e7f0c4 100644
--- a/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
+++ b/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc
@@ -30,7 +30,7 @@ TrajectoryStateOnSurface GsfMultiStateUpdator::update(const TrajectoryStateOnSur
   int i = 0;
   for (auto const& tsosI : predictedComponents) {
     TrajectoryStateOnSurface updatedTSOS = KFUpdator().update(tsosI, aRecHit);
-    if (updatedTSOS.isValid()) {
+    if (updatedTSOS.isValid() && updatedTSOS.localError().valid()) {
       result.addState(TrajectoryStateOnSurface(weights[i],
                                                updatedTSOS.localParameters(),
                                                updatedTSOS.localError(),

the event Event run: 356948 lumi: 46 event: 59404076, does get reconstructed, albeit with quite a few warnings and an ominous error message:

%MSG
%MSG-e PosteriorWeightsCalculator:  GsfTrackProducer:lowPtGsfEleGsfTracks  12-Aug-2022 13:45:19 CEST Run: 356948 Event: 59404076
PosteriorWeightsCalculator: sumWeight < DBL_MIN
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  12-Aug-2022 13:45:19 CEST Run: 356948 Event: 59404076
 no weights could be retreived. invalid updated state !.
%MSG
%MSG-e InvalidState:  GsfTrackProducer:lowPtGsfEleGsfTracks  12-Aug-2022 13:45:19 CEST Run: 356948 Event: 59404076
inside hit
%MSG

I'd appreciate feedback from any electron reconstruction expert (if any still exists).

bainbrid commented 2 years ago

Thank @mmusich. Unfortunately, I'm not an expert of the GSF tracking code ... @lsoffi @swagata87 @VinInn ?

swagata87 commented 2 years ago

I have seen those 3 error message appearing together several times while running electron reco at HLT level recently. I suspect (purely observational though) that those 3 messages are related. If for some reason sumWeight < DBL_MIN , ie, if sumWeight is basically 0, then an empty weight vector is returned: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/GsfTracking/src/PosteriorWeightsCalculator.cc#L112-L115 which leads to the 2nd error message, no weights could be retreived. invalid updated state !, because weights are empty, https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/GsfTracking/src/GsfMultiStateUpdator.cc#L22-L25. The 3rd one is not so obviously related to the 1st/2nd, but I've seen it appearing together with them, so I assume they are somehow related: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/TrackingTools/GsfTracking/src/GsfTrajectoryFitter.cc#L75-L77

Before saying that the errors are harmless, I would like to do some sanity checks. In events where these errors come up, I will check the compatibility of electron supercluster Et/Eta/Phi and electron GSF Pt/Eta/Phi. I will get back to you here. If things look reasonable in the test, we can go ahead with Marco's solution.

swagata87 commented 2 years ago

I checked only one problematic event, the one that Marco pointed out above (356948:59404076). Here are my findings: In this event, there is one reconstructed supercluster in endcap, with energy= 33.4 GeV, Eta=-2.6, Phi=-1.8, and Et=4.9 GeV. And there are 2 reconstructed GSF tracks: GSF1 Pt=0.978 GeV, Eta=-2.52, Phi=-1.86 GSF2 Pt=0.878 GeV, Eta=-2.57, Phi=-2.03

Pt of the supercluster does not match with any of the GSF, but Eta/Phi sort of matches.

There is no reconstructed electron in the event. So, it's hard to judge just from this one event. The supercluster could be due to a jet, and not a real electron.

Does anyone know if there is any other event with same issue? Maybe that can be checked.

mmusich commented 2 years ago

Does anyone know if there is any other event with same issue? Maybe that can be checked.

@swagata87 if you follow the link to the cmsTalk post provided by the OP, you will see that there is a second instance of the same problem in run 356947, tarball at: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2022C/DQMLogicError/vocms013.cern.ch-1700050-3-log.tar.gz

mmusich commented 2 years ago

while the e/gamma experts take a second look, I've opened https://github.com/cms-sw/cmssw/pull/39056 in order to start giving it some wider exposure.

bainbrid commented 2 years ago

I checked only one problematic event, the one that Marco pointed out above (356948:59404076). Here are my findings: In this event, there is one reconstructed supercluster in endcap, with energy= 33.4 GeV, Eta=-2.6, Phi=-1.8, and Et=4.9 GeV. And there are 2 reconstructed GSF tracks: GSF1 Pt=0.978 GeV, Eta=-2.52, Phi=-1.86 GSF2 Pt=0.878 GeV, Eta=-2.57, Phi=-2.03

Pt of the supercluster does not match with any of the GSF, but Eta/Phi sort of matches.

There is no reconstructed electron in the event. So, it's hard to judge just from this one event. The supercluster could be due to a jet, and not a real electron.

Does anyone know if there is any other event with same issue? Maybe that can be checked.

Hi @swagata87, this is entirely possible with the low-pT algorithm. The chain is tracker-driven and GSF tracks are used to seed a search for a SC in ECAL. The SC algo is "greedy" in the sense that the GSF track is always dR-matched to the first (and nearest) ECAL cluster found, i.e. there is no upper bound in dR, while all subsequent clusters must be compatible with the GSF track trajectory according to requirements in dEta and dPhi. (Fake electrons with "mismatched" tracks/SCs are later dealt with via the ID.)

I'm not sure how this interplays with the exception, as it is thrown at the "GSF tracking" stage, before any match to a SC is attempted ... (Ok, technically, the preceding "seeding" step does a "lightweight" GSF tracking and attempts to match to an ECAL / HCAL cluster...)

swagata87 commented 2 years ago

the other event (356947:413408861 happening in JetMET dataset) can't be checked because the RAW file is not there in the location anymore. Input file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2022C/JetMET/RAW/v1/000/356/947/00000/e7e7fd8c-d2ed-47c0-a46f-978790000957.root?eos.app=cmst0 could not be opened.

in that case, I'd say let's proceed with the fix. From e/gamma side, we will closely check electron reco efficiency in release validation, after this fix is merged, just to make sure that things are fine.

thank you @mmusich and @bainbrid

mmusich commented 2 years ago

can't be checked because the RAW file is not there in the location anymore.

indeed, but it's surprising. What's the current policy to keep RAW data on disk at Tier-0 @germanfgv ?

germanfgv commented 2 years ago

@mmusich all data we produce stays at T0 for a minimum of 7 days. Then, if it has been transferred to the desired locations and is not being used by us anymore, it gets deleted.

As we failed the paused jobs last Friday, these files were deleted from T0 disk.

mmusich commented 2 years ago

+1

swagata87 commented 2 years ago

[If it is not okay to write in an closed issue, then I apologise, and can create a new issue. I am writing here as this is very much related]

It looks like the same issue came back (only once, and at HLT level) in a collision run, despite the fix. It is clear that the crash is extremely rare, but it is surprising that it came back despite the fix was merged.

Run number: 359686 StartTime: Oct 1 2022, 01:11 EndTime: Oct 1 2022, 07:19 Error message:

[2] Calling method for module GsfTrackProducer/'hltEgammaGsfTracks'
Exception Message:
MultiTrajectoryState mixes states with and without errors

HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.0/HLT/V5 Relevant elog: http://cmsonline.cern.ch/cms-elog/1157779

We can follow up once the error stream file is available.

slava77 commented 2 years ago

If it is not okay to write in an closed issue, then I apologise, and can create a new issue. I am writing here as this is very much related

I'd suggest to open a new issue. A note made here is useful still.