cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

DQM memory usage in 2022 re-reco #42504

Open makortel opened 1 year ago

makortel commented 1 year ago

This issue is to follow up the investigation started in https://github.com/cms-sw/cmssw/issues/40437 for the parts concerning DQM memory usage.

I profiled the job https://github.com/cms-sw/cmssw/issues/40437#issuecomment-1630699980 with IgProf MEM_LIVE with

In both cases the profile shows the state of the heap memory after 10th event.

makortel commented 1 year ago

assign dqm

cmsbuild commented 1 year ago

New categories assigned: dqm

@tjavaid,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

One view into the memory is given by DQMEDAnalyzer::beginRun()

Notably the HLTMuonOfflineAnalyzer::dqmBeginRun() should have already been fixed in https://github.com/cms-sw/cmssw/pull/42437 and its backports.

makortel commented 1 year ago

SiPixelPhase1TrackClusters::dqmBeginRun() takes 210 MB per stream in SiPixelTemplate::pushfile() in https://github.com/cms-sw/cmssw/blob/bce2eaf8a4286ef0badaaa83197dd734ec282073/DQM/SiPixelPhase1Track/plugins/SiPixelPhase1TrackClusters.cc#L124-L130 https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue40437/reco_07.10_live/119

The SiPixelTemplate::pushfile() is used also in PixelCPEClusterRepair constructor. Are pixel template details at this level really needed in DQM running as part of re-reco? If yes, is there any chance the same pushfile() contents could be shared by all streams and PixelCPEClusterRepair? @cms-sw/trk-dpg-l2

VinInn commented 1 year ago

210MB per stream saturates DQM memory budget by itself. It MUST be solved.

mmusich commented 1 year ago

Are pixel template details at this level really needed in DQM running as part of re-reco?

this part of code was introduced in https://github.com/cms-sw/cmssw/pull/33635 with the stated goal of being able to monitor

  • Cluster Charge corrected with the tempalte
  • the template correction itself (this distribution are added for each layers/rings/disks)

this is needed, because certain groups (e.g. in EXO PAG) request the corrections to be available for private analysis usage in pixel dE/dx related quantities and thus the need of monitoring the corrected charge in DQM (also and especially in re-reco). Having said this, what is actually needed is this:

templ.interpolate(templateDBobject_->getTemplateID(id), cotAlpha, cotBeta, locBz, locBx);
auto charge_cor = (charge * templ.qscale()) / templ.r_qMeas_qTrue();
templ.qscale() / templ.r_qMeas_qTrue();

I am not sure it there is a way to obtain just the qscale() and r_qMeas_qTrue() without pushing the full template and interpolating it.

If yes, is there any chance the same pushfile() contents could be shared by all streams and PixelCPEClusterRepair?

would creating the conditions in ESProducer and consuming its product instead of the "raw" database object, solve this issue as well (see https://github.com/cms-sw/cmssw/issues/40544) ?

Let me tag here @tvami @pmaksim1 @ferencek @mroguljic, not sure if there is any other pixel expert to be tagged.

makortel commented 1 year ago

If yes, is there any chance the same pushfile() contents could be shared by all streams and PixelCPEClusterRepair?

would creating the conditions in ESProducer and consuming its product instead of the "raw" database object, solve this issue as well (see #40544) ?

I'm pretty sure an ESProduct would solve the issue.

Dr15Jones commented 1 year ago

Unless their is an objection, I will make the necessary ESProducer today.

mmusich commented 1 year ago

cross-posting from https://github.com/cms-sw/cmssw/issues/40544#issuecomment-1669985444 @cms-sw/pdmv-l2 what's the driver command used for the 2022 re-reco?

Dr15Jones commented 1 year ago

The conversion to using an ESProducer can be found here #42514

sunilUIET commented 1 year ago

@mmusich

You can find 2022 rereco commands here

https://cms-pdmv-prod.web.cern.ch/rereco/api/requests/get_cmsdriver/ReReco-Run2022C-EGamma-27Jun2023-00001