cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Crash for run 355207-355208 in express StreamHLTMonitor #38626

Open ttedeschi opened 2 years ago

ttedeschi commented 2 years ago

As pointed out here https://cms-talk.web.cern.ch/t/paused-jobs-for-run-355207-collisions/12561 the following error is encountered when running express_StreamHLTMonitor workflow for both runs 355207 and 355208

2022-07-07 09:05:58,824:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_3_6', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
CMSSW Return code: 8006

2022-07-07 09:05:58,824:CRITICAL:CMSSW:Error message: An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 355207 lumi: 582 event: 669724552 stream: 6
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module BTVHLTOfflineSource/'BTVHLTOfflineSource'
Exception Message:
RefCore: A request to resolve a reference to a product of type 'reco::Candidate' with ProductID '1:2358'
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

full info can be found here: /afs/cern.ch/user/c/cmst0/public/PausedJobs/ExpressHLTMonitor/job/WMTaskSpace/cmsRun1

a possible fix, even a temporary one, should be found urgently

cmsbuild commented 2 years ago

A new Issue was created by @ttedeschi Tommaso Tedeschi.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

tvami commented 2 years ago

assign dqm

francescobrivio commented 2 years ago

assign dqm

cmsbuild commented 2 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

jfernan2 commented 2 years ago

I would comment this line to remove the module until BTV experts solve it:

https://github.com/cms-sw/cmssw/blob/1157e3664510ba0f9300604d3e85dab99771ffbe/DQMOffline/Trigger/python/DQMOffline_Trigger_cff.py#L194

And bTagHLTTrackMonitoringSequence too since it may depend on the former....

fabiocos commented 2 years ago

In the provided tarball, executed in single threaded mode, the crash happens at event 200 and it looks like to happen at

https://github.com/cms-sw/cmssw/blob/master/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc#L438

likely something wrong with jetSVTagsCollPF

jfernan2 commented 2 years ago

FYI @marco-link @johnalison @JyothsnaKomaragiri @natalia-korneeva @SWuchterl as BTV HLT DQM And Validation Code Developers https://twiki.cern.ch/twiki/bin/viewauth/CMS/DQMContacts#Btag_and_vertexing

NiclasEich commented 2 years ago

BTV HLT DQM developers (@marco-link @terrill37 me) are working on it and can replicate the error.

mmusich commented 2 years ago

seems the place in which it fails is at

https://github.com/cms-sw/cmssw/blob/1157e3664510ba0f9300604d3e85dab99771ffbe/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc#L798

NiclasEich commented 2 years ago

seems the place in which it fails is at

https://github.com/cms-sw/cmssw/blob/1157e3664510ba0f9300604d3e85dab99771ffbe/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc#L798

Indeed, we have implemented a sanity check for null-pointer in the following push_back calls, which resolves the error.

We are currently preparing the PRs and checking the tests.

mmusich commented 2 years ago

We are currently preparing the PRs and checking the tests.

OK. This

diff --git a/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc b/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc
index 464127097b7..19a013938cc 100644
--- a/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc
+++ b/DQMOffline/Trigger/plugins/BTVHLTOfflineSource.cc
@@ -795,6 +795,8 @@ std::vector<const reco::Track*> BTVHLTOfflineSource::getOnlineBTagTracks(float h
     unsigned int trackSize = ipInfo.selectedTracks().size();
     for (unsigned int itt = 0; itt < trackSize; ++itt) {
       const auto ptrackRef = (ipInfo.selectedTracks()[itt]);  //TrackRef or
+      if (!ptrackRef.isAvailable())
+        continue;
       const reco::Track* ptrackPtr = reco::btag::toTrack(ptrackRef);
       onlineTracks.push_back(ptrackPtr);
       onlineIP3D.push_back(ip[itt].ip3d.value());

trivially works for me (at least gets past the error), but I am not sure if that's what people want.

missirol commented 2 years ago

Thanks to the people who are debugging.

I'm in favour of a(ny) fix being integrated asap, but it would also be useful to know the reason behind this failure. Is some track collection missing in the input file wrt what the DQM client needs/expects?

Knowing better the reason behind the problem might suggest a workaround that we could apply to the HLT menu to be used online this weekend. This could reduce pressure to deploy a new release online.

marco-link commented 2 years ago

Opened the PR. Thanks @mmusich for working on this in parallel (you were ~20 minutes ahead of us :grinning:)

I'll prepare backports for 12_3_X and 12_4_X.

NiclasEich commented 2 years ago

Thanks to the people who are debugging.

I'm in favour of a(ny) fix being integrated asap, but it would also be useful to know the reason behind this failure. Is some track collection missing in the input file wrt what the DQM client needs/expects?

Knowing better the reason behind the problem might suggest a workaround that we could apply to the HLT menu to be used online this weekend. This could reduce pressure to deploy a new release online.

We are still investigating and it is not clear to us which collection is missing. It might be connected to code that is only used by BTV but we will follow up on that.