cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.06k stars 4.25k forks source link

Profiling a T0 prompt reco workflow #36282

Closed jpata closed 2 years ago

jpata commented 2 years ago

In the context of the recent issues with OOM in reco workflows at T0, we are looking into setting up a regular profiling for an 8-threaded prompt reco like workflow.

So far we have come up with the following

runTheMatrix.py --ibeos -l 136.889 --command="-n 5000 --nThreads 8 --customise Validation/Performance/TimeMemoryInfo.py"

which gives something like the following for the RSS of the full job of 136.889_RunMET2018D+RunMET2018D+HLTDR2_2018+RECODR2_2018reHLT_skimMET_Prompt+HARVEST2018_Prompt/step3_RAW2DIGI_L1Reco_RECO_SKIM_EI_PAT_ALCA_DQM.py

From what I see, it's far from the 16GB redline, but we are looking to keep an eye on it going forward.

@cms-sw/dqm-l2 @cms-sw/alca-l2 is this representative of what's running at T0 as far as ALCA and DQM are concerned? If not, can you suggest a more representative workflow?

cc @cms-sw/reconstruction-l2 @clacaputo

jpata commented 2 years ago

assign reconstruction

cmsbuild commented 2 years ago

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

A new Issue was created by @jpata Joosep Pata.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 commented 2 years ago

@cms-sw/dqm-l2 @cms-sw/alca-l2 is this representative of what's running at T0 as far as ALCA and DQM are concerned? If not, can you suggest a more representative workflow?

May I suggest that ALCA and DQM check the MET PD configuration in the promptReco and clarify if the 136.889_RunMET2018D workflow actually runs at least as many ALCA/Skim/DQM elements as the prompt reco. I think that having more in the matrix workflow is OK, but having less is not good especially if the missing ALCA/Skim/DQM elements are not exercised anywhere else in the matrix workflows.

jfernan2 commented 2 years ago

wf136.889 in 12_1_0 is running at step3 DQM:@standardDQMFakeHLT+@miniAODDQM
where @standardDQMFakeHLT is @ecal+@hcal+@hcal2+@strip+@pixel+@castor+@ctpps+@muon+@tracking+@jetmet+@egamma+@L1TMon+@btag+@beam+@physics'

while MET Prompt PD runs in the replay triggered at https://github.com/dmwm/T0/pull/4619 @common+@jetmet+@L1TMon+@hcal"
where @common is @stripCommon+@pixel+@tracking+@hlt+@beam+@castor+@physics

So, wf136.889 has a larger DQM sequence load than the Prompt MET giving issues in the replay at https://github.com/dmwm/T0/pull/4619#issuecomment-981495384

Full definition of sequences in: https://github.com/jfernan2/cmssw/blob/da89fa94b73e9e4560ceaaebaffbaa2bf478a096/DQMOffline/Configuration/python/autoDQM.py Thanks

tvami commented 2 years ago

136.889 is running ALCA:SiStripCalZeroBias+SiStripCalMinBias+SiStripCalSmallBiasScan+TkAlMinBias+EcalESAlign and there is no AlCa harvesting step in it. Should we have a wf that does more of the PCL work?

Checking a T0 config I see HcalCalNoise https://github.com/dmwm/T0/blob/b609841f28e8ff0a63190b374866fc337526bb44/etc/ReplayOfflineConfiguration.py#L639

So 136.889 is running more things than T0 but not the one that T0 is really running :)

mmusich commented 2 years ago

Should we have a wf that does more of the PCL work?

The PCL runs on the express and not on the MET prompt reco.

slava77 commented 2 years ago

136.889 is running ALCA:SiStripCalZeroBias+SiStripCalMinBias+SiStripCalSmallBiasScan+TkAlMinBias+EcalESAlign and there is no AlCa harvesting step in it. Should we have a wf that does more of the PCL work?

Checking a T0 config I see HcalCalNoise https://github.com/dmwm/T0/blob/b609841f28e8ff0a63190b374866fc337526bb44/etc/ReplayOfflineConfiguration.py#L639

So 136.889 is running more things than T0 but not the one that T0 is really running :)

it sounds like a good case to update the workflow 136.889 already.

I also see physics_skims=["EXOMONOPOLE", "HighMET", "LogError", "LogErrorMonitor"] in T0 config, compared to SKIM:HighMET+EXOMONOPOLE. So, LogError and LogErrorMonitor could be added as well.

dpiparo commented 2 years ago

It will be very useful to have this workflow monitored, thanks for preparing it. Would a plot with time on the x axis and memory on the y axis, one 'entry' per event help us keeping an eye on throughput and memory at the same time?

jpata commented 2 years ago

What we have as a starting point now is the multithreaded report from TimeMemoryInfo, which contains the memory after each event (with a timestamp), and the time spent processing each event.

...
TimeModule> 54835382 320822 SiStripMonitorClusterBPTX SiStripMonitorCluster 0.035146
TimeModule> 54440826 320822 RECOoutput_step EndPathStatusInserter 5.96046e-06
TimeModule> 54679285 320822 SiStripMonitorTrackCommon SiStripMonitorTrack 0.233463
TimeEvent> 54440826 320822 23.6699
%MSG-w MemoryCheck:  PostProcessPath 26-Nov-2021 15:35:13 CET  Run: 320822 Event: 54440826
MemoryCheck: event : VSIZE 12746.7 0 RSS 7567.22 0
%MSG
...

In addition to the memory as a function of time shown above, are we mainly interested in throughput as a function of time, or as an average value for the whole workflow?

Our first proposal, once the workflow content is settled, would be to make the same plots as in http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_12_2_step3_11834.21.html, on the same website.

Would this be along the lines of what you suggested?

jpata commented 2 years ago

I wanted to circle back on this to understand if the workflow 136.889 is realistic to monitor from the ALCA and DQM point of view?

jfernan2 commented 2 years ago

From the DQM point of view, yes Thanks

slava77 commented 2 years ago

Now that 136.889 is established as a good reference for run-3 prompt reco, it would be nice to update the profiling workflows in the IBs to include it. Currently we are running 136.731 as can be seen e.g. in CMSSW_12_3_X_2022-02-16-1100 pp report

I created cms-sw/cms-bot#1712

slava77 commented 2 years ago

it may be useful to also make the resource piecharts, but it looks like the job/configuration injection there is different. @gartung perhaps this can be discussed in the context of what should/could be monitored regularly.

gartung commented 2 years ago

The 136.731 workflow is only run for the IB Igprof job. It can be added to the profiling job without too much trouble.

jpata commented 2 years ago

+reconstruction

cmsbuild commented 2 years ago

This issue is fully signed and ready to be closed.

jpata commented 2 years ago

@cms-sw/alca-l2 @cms-sw/dqm-l2

In this thread https://cms-talk.web.cern.ch/t/high-memory-usage-in-promptreco-jobs-for-run-352516/11040/13 it was pointed out that a problematic module with a memory leak was not present in this workflow (fixed in https://github.com/cms-sw/cmssw/pull/38177) and hence was not detected at integration time, only by having the jobs killed by an RSS watchdog at Tier0.

Should we revise what is running in this workflow from the ALCA+DQM point of view?

EDIT: If yes, ALCA or DQM please open a new issue (to prevent spamming unrelated accounts)

Jetmet commented 2 years ago

这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

jpata commented 2 years ago

@jfernan2 looks like your comment above https://github.com/cms-sw/cmssw/issues/36282#issuecomment-981823572 tagged a bunch of unrelated accounts.

@smuzaffar is there a way to revert this?

smuzaffar commented 2 years ago

@jpata , I do not know of any way to undo it. each user can trun off their notifications by clicking to Unsubscribe button

smuzaffar commented 2 years ago

only thing I can suggest is to open a new issue (duplicate of this) and lock the conversation here :-)

jfernan2 commented 2 years ago

@jfernan2 looks like your comment above #36282 (comment) tagged a bunch of unrelated accounts.

I am sorry, perhaps I should have used quotation marks for the modules...

tvami commented 2 years ago

only thing I can suggest is to open a new issue (duplicate of this) and lock the conversation here :-)

I did that and announced it on the cmsTalk https://cms-talk.web.cern.ch/t/high-memory-usage-in-promptreco-jobs-for-run-352516/11040/13