cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Problem in DQM Harvesting step with EgHLTOfflineClient #38970

Open rvenditti opened 2 years ago

rvenditti commented 2 years ago

As a follow up of Express job killed at T0 for memory issues at harvesting step in runs 356381 and 356615 link we found that the log file shows some problem in HLT-Egamma client: The message is %MSG-e HLTConfigProvider: EgHLTOfflineClient:egHLTOffDQMClient@beginRun 29-Jul-2022 10:57:14 CEST Run: 356381 Falling back to ProcessName-only init using ProcessName 'HLT' ! %MSG %MSG-e HLTConfigProvider: EgHLTOfflineClient:egHLTOffDQMClient@beginRun 29-Jul-2022 10:57:14 CEST Run: 356381 Process name 'HLT' not found in registry! %MSG

We believe that this could lead to the memory issue seen in the Express reconstruction. Can HLT DQM experts have a look?

cmsbuild commented 2 years ago

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 years ago

assign dqm

makortel commented 2 years ago

FYI @cms-sw/hlt-l2 @cms-sw/egamma-pog-l2

cmsbuild commented 2 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

swagata87 commented 2 years ago

I've seen the Process name 'HLT' not found in registry! issue before[1], but as far as I am aware, it's been there since quite some time now and probably wasn't fixed yet as it seems like a rather harmless error message (although need to be debugged and fixed at some point). I'll be surprised if this creates memory issue.

Btw, looking at the log files of the 2 runs, I see several other error messages. For example:

%MSG-e MergeFailure:  source 29-Jul-2022 10:16:10 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG
%MSG
%MSG-e DQMGenericClient:  DQMGenericClient:HiJetClient@endRun  29-Jul-2022 10:58:33 CEST End Run: 356381
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/HI !!!
%MSG
%MSG-e DQMCorrelationClient:   DQMCorrelationClient:pixelClusterVsLumiPXBarrel@endProcessBlock  29-Jul-2022 10:59:03 CEST post-events
MEs not found! HLT/Pixel/num_clusters_per_Lumisection_PXBarrel not found
%MSG
%MSG-e DQMGenericClient:   HLTMuonRefMethod:hltMuonRefEfficienciesMR@endProcessBlock  03-Aug-2022 08:44:14 CEST post-events
 DQMGenericClient::findAllSubdirectories ==> Missing folder HLT/Muon/MR !!!
%MSG

Could any of these trigger the memory issue? @rvenditti

[1] https://cms-talk.web.cern.ch/t/replay-for-testing-the-run-3-collisions-setup/10676

rvenditti commented 2 years ago

Hi @swagata87 thanks for the comment, indeed we have asked also to CTPPS experts https://github.com/cms-sw/cmssw/issues/38969 to have a look. BTW, as it is pointed in https://cms-talk.web.cern.ch/t/replay-for-testing-the-run-3-collisions-setup/10676, the responsible for the memory issue could be something completely different from the warnings in the cmsRun-stdout.log file since the warnings seem to be there since long time).

@germanfgv are there any other files to be checked in the job report folder from which we can access the stack trace for this job?

germanfgv commented 2 years ago

@rvenditti we don't have access to the stack trace of the job at the moment of termination. You can find 3 sets of log files in the tarball:

Condor logs: _condor_std*
Agent logs: wmagentJob.log (here you can see the performance monitor scanning the use of memory periodically)
cmsRun logs:job/WMTaskSpace/cmsRun1/cmsRun1-stdo*

Other than that, nomore information is available

missirol commented 2 years ago

Maybe (re-)stating the obvious: the HLT-related warnings are unrelated to the main issue, i.e. https://github.com/cms-sw/cmssw/issues/38976.

I had a look at the warnings, and I think their origin is clear: the Harvesting modules in question, i.e. instances of EgHLTOfflineClient and EgHLTOfflineSummaryClient, use HLTConfigProvider to find the names of relevant e/gamma HLT paths and filters, and those names are then used to look for input histograms (or, 'monitor elements'), and create outputs (e.g. efficiency graphs, etc). The problem is that HLTConfigProvider will fail and issue a warning when running on DQMIO files, as it won't find there the relevant inputs with process label "HLT". I believe (but haven't checked) these Harvesting modules will instead work as is when DQM+Harvesting steps run on EDM inputs (e.g. AOD files).

In this particular example (and this is maybe not true in other cases), egHLTOffDQMClient uses (runClientEndJob = False, runClientEndLumiBlock = False, runClientEndRun = True), but runClientEndRun is never used inside the plugin, and ultimately the function runClient_ (which creates the harvesting outputs) would not run in any case (regardless of the issue with HLTConfigProvider..): https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_X/DQMOffline/Trigger/plugins/EgHLTOfflineClient.cc#L43 https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_X/DQMOffline/Trigger/plugins/EgHLTOfflineClient.cc#L91

I think one could try to improve these Harvesting modules by extracting the relevant filter/path names based on the available input histograms; this way, the module could work both (1) on DQMIO inputs and (2) when DQM+Harvesting run on EDM inputs. Before updating the plugins though, it should probably be clarified whether these plugins are actually important and worth updating. This can only be answered by EGM (@swagata87) and DQM experts.

(Reminder: the workflows of the HLT offline-DQM are maintained by DQM, and mostly developed by POGs; they are not under the direct watch of HLT L2s.)