High memory usage in DQM harvesting job for Express

rvenditti commented 2 years ago

DQMHarvesting is exceeding maxPSS in Express reconstruction at T0 in runs 356381, 356615, 356719 link to cmsTalk.

We re-run job on lxplus and they have been completed successfully even if with some warnings (see below)
Looking inside the tarball and local running jobs, it seems that the output of cmsRun shows some warnings in HLTConfigProvider for HLT-EGamma client and at merging step for CTTPS. In order to investigate these warnings two GitHub issues have been opened: https://github.com/cms-sw/cmssw/issues/38969 https://github.com/cms-sw/cmssw/issues/38970
However, looking at a similar issue observed in May, it seems that the warnings are there since long time and are not the root of the problem.
Running the IgProf tool for the memory profiling, it seems that the main memory consumer is the DQMFileSaver DQMFileSaver::saveForOffline function (see /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and /afs/cern.ch/work/r/rosma/public/Run356615/).... but we don't know how much this check is reliable given that Harvesting is not running on events. Can software expert have a look and give us some suggestions? @makortel

cmsbuild commented 2 years ago

A new Issue was created by @rvenditti .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

davidlange6 commented 2 years ago

could you post the igprof results please?

davidlange6 commented 2 years ago

(eg, the html version is more easily understandable)

makortel commented 2 years ago

assign dqm

cmsbuild commented 2 years ago

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

rvenditti commented 2 years ago

Hi @davidlange6 , unfortunately i am not able to forward the browsable version to my cern site. However the related sql3 file is here: /eos/user/r/rosma/www/cgi-bin/data/igreport_perf.sql3

makortel commented 2 years ago

I took a quick look on /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and I see the profile reports MEM_TOTAL, i.e. total memory allocations. MEM_LIVE (that shows the memory usage at a given time) would be more useful. It seems to me that IgProfService does not have dump points for framework callbacks that would be most useful for harvesting job. I'll add some and profile myself too.

On the other hand, I'm not sure how useful the IgProf profile is for a harvesting job. We know that the job processes histograms, but IgProf profile won't tell much about what histograms exactly take a lot of memory (beyond the type). Have you checked the histogram sizes from a DQMIO file? I think that would be a useful number, even if DQMRootSource should merge the histograms on each file open, and e.g. in https://cms-talk.web.cern.ch/t/re-2018-replay-for-pps-pcl-test-dqm-expresmergewrite-memory-too-high/6114/7 the memory problem was clearly in the output.

makortel commented 2 years ago

One thing I noticed is the job is awfully slow under igprof -mp ... cmsRunGlibC. While restarting my memory profile after copying the input files for run 356615 (for the run 356381 the files were already gone) I profiled the CPU time consumption. Full profile is here https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf Of the 1623 seconds of total

TTree::GetEntry() takes 526 s (32 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/18
Histogram merging takes 290 s (18 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/35
- Main contribution is from TProfile, 252 s (16 %)
edm::one::EDProducerBase::doEndLuminosityBlock() takes 612 s (38 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/31, largest contributions
- GEMDQMHarvester 287 s (18 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/36
- SiStripQualityChecker 68 s (4 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/75
- QualityTester 54 s (3 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/82
DQMEDHarvester::endProcessBlockProduce() takes 20 s (1 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/134
DQMFileSaver::saveForOffline() takes 17 s (1 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/147

Of these the TProfile merging and GEMDQMHarvester could be something worth of taking a look (even if strictly speaking outside of the scope of this issue).

makortel commented 2 years ago

A MEM_LIVE profile (dumped at postGlobalEndRun) points to GEMDQMHarvester::dqmEndLuminosityBlock() taking 610 MB https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/25 . Nearly all of that appears to be in std::maps in GEMDQMHarvester::createTableWatchingSummary() https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/27.

The maps are initialized in https://github.com/cms-sw/cmssw/blob/bbffe914531dce497ea0f4576ab7509bee4a7384/DQM/GEM/plugins/GEMDQMHarvester.cc#L282-L298

The profile indicates

mapStatusChambersSummary_ and mapNumStatusChambersSummary_ have 160,380 elements
mapStatusVFATsSummary_ and mapNumStatusVFATsSummary_ have 3,849,120 elements

I'm afraid the GEMDQMHarvester::drawSummaryHistogram() needs to be redesigned.

makortel commented 2 years ago

FYI @cms-sw/gem-dpg-l2

jshlee commented 2 years ago

@slowmoyang @quark2 can you take a look at this.

mmusich commented 2 years ago

@jshlee @slowmoyang @quark2 any update on this issue?

quark2 commented 2 years ago

Hi @mmusich,

Sorry for my late reply, I'm working on it, but I was sick, so the task was delayed... I'll make a fix soon.

Best regards, Byeonghak Ko

mmusich commented 2 years ago

Hi @quark2, there have been already 5 instances of the issue so far - if I count correctly, can you clarify the timeline for the fix? In case it cannot arrive by today, such that a patch release can be built tomorrow after the ORP meeting (FYI @perrotta @qliphy), can we consider disabling in production the offending module until a more proper fix is found out? I think we can do that by commenting these lines:

https://github.com/cms-sw/cmssw/blob/4c7b44b77f2ae84e06b6eb717f5822e1f19b6461/DQMOffline/Configuration/python/DQMOffline_SecondStep_cff.py#L34-L37

please clarify. Thanks,

Marco (ORM)

quark2 commented 2 years ago

Hi @mmusich,

I made a fix, but I have no idea how to measure the memory consumption with the fix. Could you provide the way? Once the fix works fine, I'll make a PR asap.

Best regards, Byeonghak Ko

mmusich commented 2 years ago

@quark2 the best way would be to run the igProf profiler and there are some instructions here: https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof. Another (quicker) way, is to just print the RSS used by cmsRun when executing the harvesting and check that it should be substantially lower than before.

quark2 commented 2 years ago

Thanks, I'll use it!

quark2 commented 2 years ago

Hi,

I've made a fix and its backport to 12_4_X. I've seen the reduction of the RSS. (Although it was still 1.9 GB; still too large?)

mmusich commented 2 years ago

I've seen the reduction of the RSS.

How much is it reduced? Matti pointed above that GEMDQMHarvester::dqmEndLuminosityBlock() is taking 610 MB. Do you have a profiling before / after the change? Is 1.9GB only due to GEM or overall by the harvesting job?

quark2 commented 2 years ago

The amount 1.9 GB is overall by the harvesting job. The RSS before the fix is 2.8 GB, so about 900 MB is saved.

mmusich commented 2 years ago

The RSS before the fix is 2.8 GB, so about 900 MB is saved.

Thanks, can you also provide what would the number in case the harvesting job is executed without GEMDQMHarvester at all? This would hep to set the scale of what is reasonable.

quark2 commented 2 years ago

I executed it without GEMDQMHarvester, and the maximum RSS was also 1.9 GB. I'm not sure if it is reasonable, but at least I can say that GEMDQMHarvester consumes basically small memory (after the issue is fixed).

quark2 commented 2 years ago

Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?

mmusich commented 2 years ago

Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?

you would need to run the express reco from the streamer files of 356381, which I strongly suspect are already gone.

quark2 commented 2 years ago

I see... I need to look at the geometry part more, and reproducing that file should give a good hint.

mmusich commented 2 years ago

I see... I need to look at the geometry part more, and reproducing that file should give a good hint.

There have been recently failures of the same type e.g. for run 357479.

I think the streamers for that file are still available:

$ eos ls /store/t0streamer/Data/Express/000/357/479 | grep .dat | wc -l
1044

you would need to generate the configuration via:

python RunExpressProcessing.py --scenario ppEra_Run3 --lfn /store/t0streamer/Data/Express/000/357/479/run357479_ls1012_streamExpress_StorageManager.dat --global-tag 101X_dataRun2_Express_v7 --fevt --dqmio

manually modifying the output configuration

process.maxEvents = cms.untracked.PSet(
    input = cms.untracked.int32(-1)
)
process.source = cms.Source("NewEventStreamFileReader",
    fileNames = cms.untracked.vstring('<put here all your input streamer files>')
)
process.options = cms.untracked.PSet(
    numberOfStreams = cms.untracked.uint32(0),
    numberOfThreads = cms.untracked.uint32(8)
)

to get the correct source. Alternatively I think you can try to use a cmsDriver command mimicking what Tier-0 does for express, starting with RAW data as input, but that would require me more time to cook. Perhaps @cms-sw/dqm-l2 can help here as well. HTH

mmusich commented 2 years ago

@makortel we tried to run the igprof profiler with @sarafiorendi and we got these:

MEM_TOTAL:
- https://fiorendi.web.cern.ch/fiorendi/cgi-bin/igprof-navigator/original/igreport_total/ (vanilla)
- https://fiorendi.web.cern.ch/fiorendi/cgi-bin/igprof-navigator/PR39061/igreport_total/ (PR #39061)
MEM_LIVE:
- https://fiorendi.web.cern.ch/fiorendi/cgi-bin/igprof-navigator/original/igreport_memlive/ (vanilla)
- https://fiorendi.web.cern.ch/fiorendi/cgi-bin/igprof-navigator/PR39061/igreport_live/ (PR #39061)
but we don't observe a substantial reduction from the GEM DQM code. This seems at odds with the observation from @quark2 https://github.com/cms-sw/cmssw/issues/38976#issuecomment-1214889830 which is somewhat in agreement with my profile posted at https://github.com/cms-sw/cmssw/pull/39061#issuecomment-1216447394. Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof

sarafiorendi commented 2 years ago

@makortel @mmusich sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction

mmusich commented 2 years ago

sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction

thanks. Still I am not sure to understand why there is no contribution from GEMDQMHarvester in MEM_LIVE.

makortel commented 2 years ago

Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof

@mmusich @sarafiorendi Was this the part of the recipe you followed for the MEM_LIVE profile?

igprof -d -t cmsRunGlibC -mp cmsRunGlibC a.py >& a.log

igprof-analyse --sqlite -v --demangle --gdb -r MEM_LIVE IgProf.1.gz > ig.1.txt

sarafiorendi commented 2 years ago

this are the commands I ran

igprof -d -t cmsRunGlibC -mp cmsRunGlibC dump_cfg.py  > & PR_30files.log
igprof-analyse --sqlite -d -v -g -r MEM_LIVE igprof.cmsRunGlibC.5556.1660650465.673794.gz | sqlite3 igreport_live.sql3

as from https://igprof.org/analysis.html#:~:text=Memory%20profiling%20reports

so maybe the second one is not correct (?)

makortel commented 2 years ago

Thanks. I'm a bit confused what dump exactly igprof.cmsRunGlibC.5556.1660650465.673794.gz corresponds to, since the Validation.Performance.IgProfInfo.customise https://github.com/cms-sw/cmssw/blob/7654b671ce6f103ca900252d8879e1a9858f3f18/Validation/Performance/python/IgProfInfo.py#L5-L9 causes a profile to be dumped after some events with the event entry number in the file name. Maybe it's the "end of process" dump, given that the recipe does not specify an output file (-o). In that case the MEM_LIVE would report memory leaks (which is not particularly relevant for this problem).

To see the memory usage of GEMDQMHarvester as in https://github.com/cms-sw/cmssw/issues/38976#issuecomment-1208610112 you can add a profile dump point to globalEndRun transition with

process.IgProfService.reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz")

(where %R gets substituted with the run number) and convert that into the sqlite3 file.

mmusich commented 1 year ago

Just for the record, this is keeping happening also in 2023, despite an increase of the memory allowance for harvesting jobs at Tier-0. A recent paused job occurred in Express_Run367696_StreamExpress (logs available here)

makortel commented 1 year ago

Is there a recipe to reproduce? (the link wasn't clear for me)

gpetruc commented 1 year ago

Hi,

tar xzvf /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023C/MemoryExpress/7c897736-9d50-4d3a-8e7b-da1e3c8d796b-Periodic-Harvest-1-3-logArchive.tar.gz
cd job/WMTaskSpace/cmsRun1
cmsRun PSet.py

The job has 598 files to process, when runnign the memory slowly increasing as the files are opened, and then increasing much faster at the end. It starts off at ~1.4GB, it's 1.7GB after 500 file but then jumps up. Logging every few seconds the number of opened files and the memory usage in kB and got

files 572 mem 1718344
files 575 mem 1718344
files 577 mem 1718344
files 580 mem 1718344
files 580 mem 1718344
files 582 mem 1833032
files 585 mem 1833032
files 587 mem 1833032
files 589 mem 1833032
files 592 mem 1833032
files 594 mem 1833032
files 597 mem 1833032
files 598 mem 1968200
files 598 mem 2586696
files 598 mem 2586696
...

And then there's lots of LogErrors in the form

%MSG-e MergeFailure:  source 20-May-2023 12:58:16 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG

and the memory usage grows furtherat least up to 2859080 (the job is still running...)

I'm testing also igprof with the command lines and reportToFileAtPostEndRun suggested above

gpetruc commented 1 year ago

If the job is let to run, the memory usage climbs all the way to 5 GB.

I didn't manage to get anything out of igprof, despite trying to set in process.IgProfService

     reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz"),
     reportToFileAtPostOpenFile = cms.untracked.string("| gzip -c > IgProf.fileOpen.%F.gz"),
     reportToFileAtPostCloseFile = cms.untracked.string("| gzip -c > IgProf.fileClose.%C.gz")

I don't get any of those dumps. I also tried to get some dumps from jemalloc but with little success

I got some dumps from valgrind massif, which seem to imply that it's mostly histograms and associated root stuff (TLists, THashLists, ...), maybe as a consequence of the failed merging the job saves a lot of duplicate histograms?

If anyone has more up-to-date instructions to profile the heap I can try them.

For those interested in reproducing the problem more quickly, limiting the job to 5-10 files it runs much more quickly (few minutes) but still exhibits a large memory growth at the end.

makortel commented 1 year ago

The IgProf memory profiling setup has, unfortunately, become broken.

@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without https://github.com/cms-sw/cmssw/pull/40899)?

I'll try out VTune.

In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?

davidlange6 commented 1 year ago

From a log it seems the input files are all held open until the end. Is that expected?

makortel commented 1 year ago

From a log it seems the input files are all held open until the end. Is that expected?

That is how DQMRootSource has been coded https://github.com/cms-sw/cmssw/blob/2334fc966c47ee08c48487fce4eaa80384a32904/DQMServices/FwkIO/plugins/DQMRootSource.cc#L474-L480 (since https://github.com/cms-sw/cmssw/pull/28622, IIUC; another PR related to DQMRootSource memory usage https://github.com/cms-sw/cmssw/pull/30889)

gartung commented 1 year ago

The IgProf memory profiling setup has, unfortunately, become broken.

@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)?

I'll try out VTune.

In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?

Besides setting up the version of the jemalloc with profiling enabled scram setup jemalloc-prof scram b ToolUpdated you need to set an environment variable MALLOC_CONF=prof_leak:true,lg_prof_sample:10,prof_final:true cmsRun config.py

davidlange6 commented 1 year ago

Hum. So some growth with run length is expected... Seems we (Chris) fixed this already once long ago

On May 22, 2023 5:13 PM, Matti Kortelainen @.***> wrote:

From a log it seems the input files are all held open until the end. Is that expected?

That is how DQMRootSource has been coded https://github.com/cms-sw/cmssw/blob/2334fc966c47ee08c48487fce4eaa80384a32904/DQMServices/FwkIO/plugins/DQMRootSource.cc#L474-L480 (since #28622https://github.com/cms-sw/cmssw/pull/28622, IIUC)

— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/38976#issuecomment-1557402829, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ56GKOPMS3N7S46RILXHN7ATANCNFSM55WFIVFQ. You are receiving this because you were mentioned.Message ID: @.***>

makortel commented 1 year ago

VTune wasn't very useful, but I managed to get the following total sizes by type

TH1F: 134.2 MB
TH2F: 369.1 MB
TProfile2D: 234.9 MB

In addition, TBuffer constructor (via TBufferFile constructor) 377.5 MB. These don't sound outrageously high to me (even if TH2F and TProfile2D are high-ish).

mmusich commented 1 month ago

For the record, a new slimmed DQM sequence was defined in https://github.com/cms-sw/cmssw/pull/41944 (backported to https://github.com/cms-sw/cmssw/pull/42018 for data-taking operations). This was deployed in era Run2023D via https://github.com/dmwm/T0/pull/4844 (Tier0 replay: https://github.com/dmwm/T0/pull/4847).

Despite this improvement, we are now experiencing again high memory usage in DQM harvesting jobs for express in Tier0. Two recent cases [*]:

Tarballs can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/vocms0314.cern.ch-3498656-3-log.tar.gz

@cms-sw/dqm-l2 can you have a look?

[*]
Incidentally both of them contain this concurrent cmssw exception in the logs:

 ----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC-----------------------
An exception of category 'InvalidCall' occurred while
   [0] Processing end ProcessBlock
   [1] Calling method for module EcalMEFormatter/'ecalMEFormatter'
Exception Message:
Electronics Mapping not initialized
----- End Fatal Exception -------------------------------------------------

which I find somewhat peculiar.

cmsbuild commented 1 month ago

cms-bot internal usage

makortel commented 1 month ago

Incidentally both of them contain this concurrent cmssw exception in the logs:

 ----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC-----------------------
An exception of category 'InvalidCall' occurred while
   [0] Processing end ProcessBlock
   [1] Calling method for module EcalMEFormatter/'ecalMEFormatter'
Exception Message:
Electronics Mapping not initialized
----- End Fatal Exception -------------------------------------------------

Assuming the exception being thrown originates from https://github.com/cms-sw/cmssw/blob/35487070be8fb31af15d56fd6c4d0aa8d2205ef0/DQM/EcalCommon/src/DQWorker.cc#L118-L124 I see the edso_.electronicsMap is set in https://github.com/cms-sw/cmssw/blob/35487070be8fb31af15d56fd6c4d0aa8d2205ef0/DQM/EcalCommon/src/DQWorker.cc#L111-L116 (note that this pattern of taking an address of an EventSetup data product in endLumi and de-referencing the pointer in endProcessBlock is not supported and not guaranteed to work @cms-sw/dqm-l2 @cms-sw/ecal-dpg-l2)

In the log I see a printout

%MSG-s ShutdownSignal:  AfterSource 01-Sep-2024 14:12:54 UTC  PostBeginProcessBlock
an external signal was sent to shutdown the job early.
%MSG

I guess the job didn't even start processing LuminosityBlocks (although I see begin/end Run messages in the log), and then the EcalMEFormatter threw the exception in endProcessBlock. Throwing an exception because no events/lumis/runs were processed is generally bad behavior.

davidlange6 commented 1 month ago

Is there any monitoring of memory requirements per release (slimmed things tend to grow…)?

On Sep 2, 2024, at 7:54 AM, Marco Musich @.***> wrote:

For the record, a new slimmed DQM sequence was defined in #41944https://github.com/cms-sw/cmssw/pull/41944 (backported to #42018https://github.com/cms-sw/cmssw/pull/42018 for data-taking operations). This was deployed in era Run2023D via dmwm/T0#4844https://github.com/dmwm/T0/pull/4844 (Tier0 replay: dmwm/T0#4847https://github.com/dmwm/T0/pull/4847).

Despite this improvement, we are now experiencing again high memory usage in DQM harvesting jobs for express in Tier0. Two recent cases [*]:

Tarballs can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/vocms0314.cern.ch-3498656-3-log.tar.gz

@cms-sw/dqm-l2https://github.com/orgs/cms-sw/teams/dqm-l2 can you have a look?

[*] Incidentally both of them contain this concurrent cmssw exception in the logs:

----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC----------------------- An exception of category 'InvalidCall' occurred while [0] Processing end ProcessBlock [1] Calling method for module EcalMEFormatter/'ecalMEFormatter' Exception Message: Electronics Mapping not initialized ----- End Fatal Exception -------------------------------------------------

which I find somewhat peculiar.

— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/38976#issuecomment-2324928876, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ27FRVM7PCRHBXK6UDZUR33PAVCNFSM6AAAAABNQPYEAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUHEZDQOBXGY. You are receiving this because you were mentioned.Message ID: @.***>

mmusich commented 1 month ago

I guess the job didn't even start processing LuminosityBlocks (although I see begin/end Run messages in the log), and then the EcalMEFormatter threw the exception in endProcessBlock. Throwing an exception because no events/lumis/runs were processed is generally bad behavior.

Thanks for this analysis!

Indeed adding this basic test:

diff --git a/DQM/EcalCommon/test/BuildFile.xml b/DQM/EcalCommon/test/BuildFile.xml
new file mode 100644
index 00000000000..67518e7f759
--- /dev/null
+++ b/DQM/EcalCommon/test/BuildFile.xml
@@ -0,0 +1,4 @@
+<bin file="testEcalCommon.cc" name="testEcalCommon">
+    <use name="FWCore/TestProcessor"/>
+    <use name="catch2"/>
+  </bin>
diff --git a/DQM/EcalCommon/test/testEcalCommon.cc b/DQM/EcalCommon/test/testEcalCommon.cc
new file mode 100644
index 00000000000..d4eb9db6a08
--- /dev/null
+++ b/DQM/EcalCommon/test/testEcalCommon.cc
@@ -0,0 +1,43 @@
+#include "FWCore/TestProcessor/interface/TestProcessor.h"
+#include "FWCore/Utilities/interface/Exception.h"
+#include "FWCore/ServiceRegistry/interface/Service.h"
+
+#define CATCH_CONFIG_MAIN
+#include "catch.hpp"
+
+// Function to run the catch2 tests
+//___________________________________________________________________________________________
+void runTestForAnalyzer(const std::string& baseConfig, const std::string& analyzerName) {
+  edm::test::TestProcessor::Config config{baseConfig};
+
+  SECTION(analyzerName + " base configuration is OK") { REQUIRE_NOTHROW(edm::test::TestProcessor(config)); }
+
+  SECTION("Run with no LuminosityBlocks") {
+    edm::test::TestProcessor tester(config);
+    REQUIRE_NOTHROW(tester.testRunWithNoLuminosityBlocks());
+  }
+}
+
+// Function to generate base configuration string
+//___________________________________________________________________________________________
+std::string generateBaseConfig(const std::string& cfiName, const std::string& analyzerName) {
+  // Define a raw string literal
+  constexpr const char* rawString = R"_(from FWCore.TestProcessor.TestProcess import *
+from DQM.EcalCommon.{}_cfi import {}
+process = TestProcess()
+process.harvester = {}
+process.moduleToTest(process.harvester)
+process.add_(cms.Service('MessageLogger'))
+process.add_(cms.Service('JobReportService'))
+process.add_(cms.Service('DQMStore'))
+    )_";
+
+  // Format the raw string literal using fmt::format
+  return fmt::format(rawString, cfiName, analyzerName, analyzerName);
+}
+
+//___________________________________________________________________________________________
+TEST_CASE("EcalMEFormatter tests", "[EcalMEFormatter]") {
+  const std::string baseConfig = generateBaseConfig("EcalMEFormatter", "ecalMEFormatter");
+  runTestForAnalyzer(baseConfig, "EcalMEFormatter");
+}

results in:

===== Test "testEcalCommon" ====
----- Begin Fatal Exception 04-Sep-2024 03:46:28 CEST-----------------------
An exception of category 'InvalidCall' occurred while
   [0] Processing end ProcessBlock
   [1] Calling method for module EcalMEFormatter/'harvester'
Exception Message:
Electronics Mapping not initialized
----- End Fatal Exception -------------------------------------------------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
testEcalCommon is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
EcalMEFormatter tests
  Run with no LuminosityBlocks
-------------------------------------------------------------------------------
src/DQM/EcalCommon/test/testEcalCommon.cc:31
...............................................................................

src/DQM/EcalCommon/test/testEcalCommon.cc:33: FAILED:
  REQUIRE_NOTHROW( tester.testRunWithNoLuminosityBlocks() )
due to unexpected exception with message:
  An exception of category 'InvalidCall' occurred while
     [0] Processing end ProcessBlock
     [1] Calling method for module EcalMEFormatter/'harvester'
  Exception Message:
  Electronics Mapping not initialized

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed

---> test testEcalCommon had ERRORS
TestTime:2
^^^^ End Test testEcalCommon ^^^^

@cms-sw/ecal-dpg-l2 FYI

mmusich commented 1 month ago

Is there any monitoring of memory requirements per release (slimmed things tend to grow…)?

not that I am aware of.

malbouis commented 2 weeks ago

There seems to be another failure of this kind in run number 385889. Here's the log: https://cmst0.web.cern.ch/CMST0/tier0/pausedJobs/data/vocms0313/wmagentJob_jobid988223.txt

kyungminparkdrums commented 1 week ago

Hi, we'll have a look at this issue -- Kyungmin on behalf of ECAL DQM team.

cms-sw / cmssw

High memory usage in DQM harvesting job for Express #38976