cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Profiling at T0 AlCa and DQM workflows #38198

Open tvami opened 2 years ago

tvami commented 2 years ago

Follow up to the issue 36282 and cmsTalk https://cms-talk.web.cern.ch/t/high-memory-usage-in-promptreco-jobs-for-run-352516/11040

So the issue is that the wf chosen in github issue 36282 is based on the MET dataset, thus AlCaHcalHBHEMuonProducer is not run on it. (It's attached to the MinBias and SingleMuon).

This is a general issue for testing, certain ALCARECOs belong to certain PDs (as defined in the AlCaRECO matrix). i.e. we either do the testing on several wf, or just decide to pick one that has most of the ALCARECOs connected to it. That would be SingleMuon.

If that's the prefered solution, we can set up a new wf after the Run3 single muon PD is done (next week?)

cmsbuild commented 2 years ago

A new Issue was created by @tvami Tamas Vami.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

tvami commented 2 years ago

assign alca,reconstruction

cmsbuild commented 2 years ago

New categories assigned: dqm,alca

@jfernan2,@ahmad3213,@yuanchao,@micsucmed,@rvenditti,@emanueleusai,@francescobrivio,@malbouis,@tvami,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 2 years ago

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 years ago

How many events are processed for the plots in http://cms-reco-profiling.web.cern.ch/cms-reco-profiling/results/summary_plot_html/CMSSW_12_4_step3_136.889.html ? (I'm confused of the x axis)

jpata commented 2 years ago

It's 5k events: https://github.com/cms-sw/cms-bot/blob/master/reco_profiling/profileRunner.py#L95.

The plot has event IDs on the x axis, we need to change that (cc @xoqhdgh1002) to be just numbers.

makortel commented 2 years ago

It's 5k events:

Thanks, good. That should be sufficient for this particular leak (or even smaller) to be visible (this would have been ~1.2 GB after 5k events).

jpata commented 2 years ago

@tvami, should we take any action here? Is there a new workflow we should switch to that would be more representative?

tvami commented 2 years ago

Hi @jpata we could use a run from this Monday. However, it will likely have limited stats, although if the tests go up to 5000 events that could probably be reached

jpata commented 2 years ago

We test about 5k now, so it wouldn't be a big change. Is there a workflow so we can give a try, and you can see if the results are useful for ALCA?

tvami commented 2 years ago

@tocheng is going to create a new wf that you can use. He promised to look at this tomorrow.

tvami commented 2 years ago

He promised to look at this tomorrow.

@tocheng do you have any updates?

tocheng commented 2 years ago

@tvami Please let me know if this is what is needed. https://github.com/cms-sw/cmssw/compare/master...tocheng:ALCA_PCL_Run3_CMSSW_12_5_X?expand=1

tvami commented 2 years ago

Hi @tocheng these are the ALCARECOs in the alcareco matrix connected to the single muon:

SingleMuon TkAlMuonIsolated, HcalCalIterativePhiSym, MuAlCalIsolatedMu, HcalCalHO, HcalCalHBHEMuonProducerFilter, SiPixelCalSingleMuonLoose, SiPixelCalSingleMuonTight

I think you missed some of them, please add those! Thanks!

tvami commented 2 years ago

And maybe we could add another one, which is purely a technical wf that adds all the ALCARECOs to the MinBias PD... this of course would physically be incorrect, but would test everything under one wf...

tvami commented 2 years ago

@tocheng please submit the PR, at this point we have good Run-3, 13.6 TeV input data

tvami commented 2 years ago

@tocheng ?

francescobrivio commented 2 years ago

Being addressed in https://github.com/cms-sw/cmssw/pull/38681

tvami commented 2 years ago

+alca

tvami commented 2 years ago

@jpata can you please take over from that? Thanks!

jpata commented 2 years ago

Thanks! Which of the two new workflows should we use this instead of 136.889? From the reco point of view they are all equivalent, so the question is, which has the most representative ALCA configuration.

Note that on the reco side, we basically submit and analyze this 8-threaded profiling job "by hand" for each prerelease - so we don't have the personpower to study a large number of workflows per release at this time.

tvami commented 2 years ago

I think you can go ahead with 1001.3, thanks!

tvami commented 2 years ago

hi @jpata do you have any update on this? thanks!

tvami commented 1 year ago

hi @cms-sw/reconstruction-l2 did this happen in the end?

tvami commented 1 year ago

@clacaputo @mandrenguyen hi guys, do you you know if the 1001.3 is being profiled after all?