Open rvenditti opened 2 years ago
A new Issue was created by @rvenditti .
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
could you post the igprof results please?
(eg, the html version is more easily understandable)
assign dqm
New categories assigned: dqm
@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
Hi @davidlange6 , unfortunately i am not able to forward the browsable version to my cern site. However the related sql3 file is here: /eos/user/r/rosma/www/cgi-bin/data/igreport_perf.sql3
I took a quick look on /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res
and I see the profile reports MEM_TOTAL
, i.e. total memory allocations. MEM_LIVE
(that shows the memory usage at a given time) would be more useful. It seems to me that IgProfService
does not have dump points for framework callbacks that would be most useful for harvesting job. I'll add some and profile myself too.
On the other hand, I'm not sure how useful the IgProf profile is for a harvesting job. We know that the job processes histograms, but IgProf profile won't tell much about what histograms exactly take a lot of memory (beyond the type). Have you checked the histogram sizes from a DQMIO file? I think that would be a useful number, even if DQMRootSource
should merge the histograms on each file open, and e.g. in https://cms-talk.web.cern.ch/t/re-2018-replay-for-pps-pcl-test-dqm-expresmergewrite-memory-too-high/6114/7 the memory problem was clearly in the output.
One thing I noticed is the job is awfully slow under igprof -mp ... cmsRunGlibC
. While restarting my memory profile after copying the input files for run 356615 (for the run 356381 the files were already gone) I profiled the CPU time consumption. Full profile is here
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf
Of the 1623 seconds of total
TTree::GetEntry()
takes 526 s (32 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/18TProfile
, 252 s (16 %)edm::one::EDProducerBase::doEndLuminosityBlock()
takes 612 s (38 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/31, largest contributions
GEMDQMHarvester
287 s (18 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/36SiStripQualityChecker
68 s (4 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/75QualityTester
54 s (3 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/82DQMEDHarvester::endProcessBlockProduce()
takes 20 s (1 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/134DQMFileSaver::saveForOffline()
takes 17 s (1 %) https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/perf/147Of these the TProfile
merging and GEMDQMHarvester
could be something worth of taking a look (even if strictly speaking outside of the scope of this issue).
A MEM_LIVE
profile (dumped at postGlobalEndRun
) points to GEMDQMHarvester::dqmEndLuminosityBlock()
taking 610 MB https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/25 . Nearly all of that appears to be in std::map
s in GEMDQMHarvester::createTableWatchingSummary()
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue38976/run356615/preEndRun.mem.356615/27.
The maps are initialized in https://github.com/cms-sw/cmssw/blob/bbffe914531dce497ea0f4576ab7509bee4a7384/DQM/GEM/plugins/GEMDQMHarvester.cc#L282-L298
The profile indicates
mapStatusChambersSummary_
and mapNumStatusChambersSummary_
have 160,380 elementsmapStatusVFATsSummary_
and mapNumStatusVFATsSummary_
have 3,849,120 elementsI'm afraid the GEMDQMHarvester::drawSummaryHistogram()
needs to be redesigned.
FYI @cms-sw/gem-dpg-l2
@slowmoyang @quark2 can you take a look at this.
@jshlee @slowmoyang @quark2 any update on this issue?
Hi @mmusich,
Sorry for my late reply, I'm working on it, but I was sick, so the task was delayed... I'll make a fix soon.
Best regards, Byeonghak Ko
Hi @quark2, there have been already 5 instances of the issue so far - if I count correctly, can you clarify the timeline for the fix? In case it cannot arrive by today, such that a patch release can be built tomorrow after the ORP meeting (FYI @perrotta @qliphy), can we consider disabling in production the offending module until a more proper fix is found out? I think we can do that by commenting these lines:
please clarify. Thanks,
Marco (ORM)
Hi @mmusich,
I made a fix, but I have no idea how to measure the memory consumption with the fix. Could you provide the way? Once the fix works fine, I'll make a PR asap.
Best regards, Byeonghak Ko
@quark2 the best way would be to run the igProf profiler and there are some instructions here: https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof. Another (quicker) way, is to just print the RSS used by cmsRun when executing the harvesting and check that it should be substantially lower than before.
Thanks, I'll use it!
Hi,
I've made a fix and its backport to 12_4_X. I've seen the reduction of the RSS. (Although it was still 1.9 GB; still too large?)
I've seen the reduction of the RSS.
How much is it reduced?
Matti pointed above that GEMDQMHarvester::dqmEndLuminosityBlock()
is taking 610 MB. Do you have a profiling before / after the change? Is 1.9GB only due to GEM or overall by the harvesting job?
The amount 1.9 GB is overall by the harvesting job. The RSS before the fix is 2.8 GB, so about 900 MB is saved.
The RSS before the fix is 2.8 GB, so about 900 MB is saved.
Thanks, can you also provide what would the number in case the harvesting job is executed without GEMDQMHarvester
at all?
This would hep to set the scale of what is reasonable.
I executed it without GEMDQMHarvester
, and the maximum RSS was also 1.9 GB. I'm not sure if it is reasonable, but at least I can say that GEMDQMHarvester
consumes basically small memory (after the issue is fixed).
Btw, does anyone know how to reproduce DQMIO
files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root
?
Btw, does anyone know how to reproduce DQMIO files in local, especially root://cms-xrd-global.cern.ch//store/express/Run2022C/StreamExpress/DQMIO/Express-v1/000/356/381/00000/0114A6C8-C901-4386-94D4-F0386E636CA0.root?
you would need to run the express reco from the streamer files of 356381, which I strongly suspect are already gone.
I see... I need to look at the geometry part more, and reproducing that file should give a good hint.
I see... I need to look at the geometry part more, and reproducing that file should give a good hint.
There have been recently failures of the same type e.g. for run 357479.
I think the streamers for that file are still available:
$ eos ls /store/t0streamer/Data/Express/000/357/479 | grep .dat | wc -l
1044
you would need to generate the configuration via:
python RunExpressProcessing.py --scenario ppEra_Run3 --lfn /store/t0streamer/Data/Express/000/357/479/run357479_ls1012_streamExpress_StorageManager.dat --global-tag 101X_dataRun2_Express_v7 --fevt --dqmio
manually modifying the output configuration
process.maxEvents = cms.untracked.PSet(
input = cms.untracked.int32(-1)
)
process.source = cms.Source("NewEventStreamFileReader",
fileNames = cms.untracked.vstring('<put here all your input streamer files>')
)
process.options = cms.untracked.PSet(
numberOfStreams = cms.untracked.uint32(0),
numberOfThreads = cms.untracked.uint32(8)
)
to get the correct source.
Alternatively I think you can try to use a cmsDriver
command mimicking what Tier-0 does for express, starting with RAW data as input, but that would require me more time to cook.
Perhaps @cms-sw/dqm-l2 can help here as well.
HTH
@makortel we tried to run the igprof profiler with @sarafiorendi and we got these:
MEM_TOTAL
:
MEM_LIVE
:
but we don't observe a substantial reduction from the GEM DQM code. This seems at odds with the observation from @quark2 https://github.com/cms-sw/cmssw/issues/38976#issuecomment-1214889830 which is somewhat in agreement with my profile posted at https://github.com/cms-sw/cmssw/pull/39061#issuecomment-1216447394. Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof
@makortel @mmusich sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction
sorry my bad, I updated the results with the PR integrated (same link as above), they show (as far as I understand) some effective reduction
thanks. Still I am not sure to understand why there is no contribution from GEMDQMHarvester
in MEM_LIVE
.
Are we missing something in the recipe? We followed https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof
@mmusich @sarafiorendi Was this the part of the recipe you followed for the MEM_LIVE profile?
igprof -d -t cmsRunGlibC -mp cmsRunGlibC a.py >& a.log igprof-analyse --sqlite -v --demangle --gdb -r MEM_LIVE IgProf.1.gz > ig.1.txt
this are the commands I ran
igprof -d -t cmsRunGlibC -mp cmsRunGlibC dump_cfg.py > & PR_30files.log
igprof-analyse --sqlite -d -v -g -r MEM_LIVE igprof.cmsRunGlibC.5556.1660650465.673794.gz | sqlite3 igreport_live.sql3
as from https://igprof.org/analysis.html#:~:text=Memory%20profiling%20reports
so maybe the second one is not correct (?)
Thanks. I'm a bit confused what dump exactly igprof.cmsRunGlibC.5556.1660650465.673794.gz
corresponds to, since the Validation.Performance.IgProfInfo.customise
https://github.com/cms-sw/cmssw/blob/7654b671ce6f103ca900252d8879e1a9858f3f18/Validation/Performance/python/IgProfInfo.py#L5-L9
causes a profile to be dumped after some events with the event entry number in the file name. Maybe it's the "end of process" dump, given that the recipe does not specify an output file (-o
). In that case the MEM_LIVE
would report memory leaks (which is not particularly relevant for this problem).
To see the memory usage of GEMDQMHarvester
as in https://github.com/cms-sw/cmssw/issues/38976#issuecomment-1208610112 you can add a profile dump point to globalEndRun
transition with
process.IgProfService.reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz")
(where %R
gets substituted with the run number) and convert that into the sqlite3 file.
Just for the record, this is keeping happening also in 2023, despite an increase of the memory allowance for harvesting jobs at Tier-0.
A recent paused job occurred in Express_Run367696_StreamExpress
(logs available here)
Hi,
tar xzvf /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2023C/MemoryExpress/7c897736-9d50-4d3a-8e7b-da1e3c8d796b-Periodic-Harvest-1-3-logArchive.tar.gz
cd job/WMTaskSpace/cmsRun1
cmsRun PSet.py
The job has 598 files to process, when runnign the memory slowly increasing as the files are opened, and then increasing much faster at the end. It starts off at ~1.4GB, it's 1.7GB after 500 file but then jumps up. Logging every few seconds the number of opened files and the memory usage in kB and got
files 572 mem 1718344
files 575 mem 1718344
files 577 mem 1718344
files 580 mem 1718344
files 580 mem 1718344
files 582 mem 1833032
files 585 mem 1833032
files 587 mem 1833032
files 589 mem 1833032
files 592 mem 1833032
files 594 mem 1833032
files 597 mem 1833032
files 598 mem 1968200
files 598 mem 2586696
files 598 mem 2586696
...
And then there's lots of LogErrors in the form
%MSG-e MergeFailure: source 20-May-2023 12:58:16 CEST PostBeginProcessBlock
Found histograms with different axis limits or different labels 'ROCs hits multiplicity per event vs LS' not merged.
%MSG
and the memory usage grows furtherat least up to 2859080
(the job is still running...)
I'm testing also igprof with the command lines and reportToFileAtPostEndRun suggested above
If the job is let to run, the memory usage climbs all the way to 5 GB.
I didn't manage to get anything out of igprof, despite trying to set in process.IgProfService
reportToFileAtPostEndRun = cms.untracked.string("| gzip -c > IgProf.endRun.%R.gz"),
reportToFileAtPostOpenFile = cms.untracked.string("| gzip -c > IgProf.fileOpen.%F.gz"),
reportToFileAtPostCloseFile = cms.untracked.string("| gzip -c > IgProf.fileClose.%C.gz")
I don't get any of those dumps. I also tried to get some dumps from jemalloc but with little success
I got some dumps from valgrind massif, which seem to imply that it's mostly histograms and associated root stuff (TLists, THashLists, ...), maybe as a consequence of the failed merging the job saves a lot of duplicate histograms?
If anyone has more up-to-date instructions to profile the heap I can try them.
For those interested in reproducing the problem more quickly, limiting the job to 5-10 files it runs much more quickly (few minutes) but still exhibits a large memory growth at the end.
The IgProf memory profiling setup has, unfortunately, become broken.
@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without https://github.com/cms-sw/cmssw/pull/40899)?
I'll try out VTune.
In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?
From a log it seems the input files are all held open until the end. Is that expected?
From a log it seems the input files are all held open until the end. Is that expected?
That is how DQMRootSource
has been coded
https://github.com/cms-sw/cmssw/blob/2334fc966c47ee08c48487fce4eaa80384a32904/DQMServices/FwkIO/plugins/DQMRootSource.cc#L474-L480
(since https://github.com/cms-sw/cmssw/pull/28622, IIUC; another PR related to DQMRootSource
memory usage https://github.com/cms-sw/cmssw/pull/30889)
The IgProf memory profiling setup has, unfortunately, become broken.
@gartung Could you remind us the recipe to use jemalloc's profiling that works in 13_0_X (i.e. without #40899)?
I'll try out VTune.
In the meantime, could @cms-sw/dqm-l2 check that the set of histograms in the merged files looks reasonable?
Besides setting up the version of the jemalloc with profiling enabled
scram setup jemalloc-prof
scram b ToolUpdated
you need to set an environment variable
MALLOC_CONF=prof_leak:true,lg_prof_sample:10,prof_final:true cmsRun config.py
Hum. So some growth with run length is expected... Seems we (Chris) fixed this already once long ago
On May 22, 2023 5:13 PM, Matti Kortelainen @.***> wrote:
From a log it seems the input files are all held open until the end. Is that expected?
That is how DQMRootSource has been coded https://github.com/cms-sw/cmssw/blob/2334fc966c47ee08c48487fce4eaa80384a32904/DQMServices/FwkIO/plugins/DQMRootSource.cc#L474-L480 (since #28622https://github.com/cms-sw/cmssw/pull/28622, IIUC)
— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/38976#issuecomment-1557402829, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ56GKOPMS3N7S46RILXHN7ATANCNFSM55WFIVFQ. You are receiving this because you were mentioned.Message ID: @.***>
VTune wasn't very useful, but I managed to get the following total sizes by type
TH1F
: 134.2 MBTH2F
: 369.1 MBTProfile2D
: 234.9 MBIn addition, TBuffer
constructor (via TBufferFile
constructor) 377.5 MB. These don't sound outrageously high to me (even if TH2F
and TProfile2D
are high-ish).
For the record, a new slimmed DQM sequence was defined in https://github.com/cms-sw/cmssw/pull/41944 (backported to https://github.com/cms-sw/cmssw/pull/42018 for data-taking operations).
This was deployed in era Run2023D
via https://github.com/dmwm/T0/pull/4844 (Tier0 replay: https://github.com/dmwm/T0/pull/4847).
Despite this improvement, we are now experiencing again high memory usage in DQM harvesting jobs for express in Tier0. Two recent cases [*]:
Tarballs can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/vocms0314.cern.ch-3498656-3-log.tar.gz
@cms-sw/dqm-l2 can you have a look?
[*]
Incidentally both of them contain this concurrent cmssw exception in the logs:
----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC-----------------------
An exception of category 'InvalidCall' occurred while
[0] Processing end ProcessBlock
[1] Calling method for module EcalMEFormatter/'ecalMEFormatter'
Exception Message:
Electronics Mapping not initialized
----- End Fatal Exception -------------------------------------------------
which I find somewhat peculiar.
cms-bot internal usage
Incidentally both of them contain this concurrent cmssw exception in the logs:
----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC----------------------- An exception of category 'InvalidCall' occurred while [0] Processing end ProcessBlock [1] Calling method for module EcalMEFormatter/'ecalMEFormatter' Exception Message: Electronics Mapping not initialized ----- End Fatal Exception -------------------------------------------------
Assuming the exception being thrown originates from
https://github.com/cms-sw/cmssw/blob/35487070be8fb31af15d56fd6c4d0aa8d2205ef0/DQM/EcalCommon/src/DQWorker.cc#L118-L124
I see the edso_.electronicsMap
is set in
https://github.com/cms-sw/cmssw/blob/35487070be8fb31af15d56fd6c4d0aa8d2205ef0/DQM/EcalCommon/src/DQWorker.cc#L111-L116
(note that this pattern of taking an address of an EventSetup data product in endLumi and de-referencing the pointer in endProcessBlock is not supported and not guaranteed to work @cms-sw/dqm-l2 @cms-sw/ecal-dpg-l2)
In the log I see a printout
%MSG-s ShutdownSignal: AfterSource 01-Sep-2024 14:12:54 UTC PostBeginProcessBlock
an external signal was sent to shutdown the job early.
%MSG
I guess the job didn't even start processing LuminosityBlocks (although I see begin/end Run messages in the log), and then the EcalMEFormatter
threw the exception in endProcessBlock
. Throwing an exception because no events/lumis/runs were processed is generally bad behavior.
Is there any monitoring of memory requirements per release (slimmed things tend to grow…)?
On Sep 2, 2024, at 7:54 AM, Marco Musich @.***> wrote:
For the record, a new slimmed DQM sequence was defined in #41944https://github.com/cms-sw/cmssw/pull/41944 (backported to #42018https://github.com/cms-sw/cmssw/pull/42018 for data-taking operations). This was deployed in era Run2023D via dmwm/T0#4844https://github.com/dmwm/T0/pull/4844 (Tier0 replay: dmwm/T0#4847https://github.com/dmwm/T0/pull/4847).
Despite this improvement, we are now experiencing again high memory usage in DQM harvesting jobs for express in Tier0. Two recent cases [*]:
Tarballs can be found at /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/vocms0314.cern.ch-3498656-3-log.tar.gz
@cms-sw/dqm-l2https://github.com/orgs/cms-sw/teams/dqm-l2 can you have a look?
[*] Incidentally both of them contain this concurrent cmssw exception in the logs:
----- Begin Fatal Exception 01-Sep-2024 14:13:09 UTC----------------------- An exception of category 'InvalidCall' occurred while [0] Processing end ProcessBlock [1] Calling method for module EcalMEFormatter/'ecalMEFormatter' Exception Message: Electronics Mapping not initialized ----- End Fatal Exception -------------------------------------------------
which I find somewhat peculiar.
— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/38976#issuecomment-2324928876, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABGPFQ27FRVM7PCRHBXK6UDZUR33PAVCNFSM6AAAAABNQPYEAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRUHEZDQOBXGY. You are receiving this because you were mentioned.Message ID: @.***>
I guess the job didn't even start processing
LuminosityBlocks
(although I see begin/end Run messages in the log), and then theEcalMEFormatter
threw the exception inendProcessBlock
. Throwing an exception because no events/lumis/runs were processed is generally bad behavior.
Thanks for this analysis!
Indeed adding this basic test:
diff --git a/DQM/EcalCommon/test/BuildFile.xml b/DQM/EcalCommon/test/BuildFile.xml
new file mode 100644
index 00000000000..67518e7f759
--- /dev/null
+++ b/DQM/EcalCommon/test/BuildFile.xml
@@ -0,0 +1,4 @@
+<bin file="testEcalCommon.cc" name="testEcalCommon">
+ <use name="FWCore/TestProcessor"/>
+ <use name="catch2"/>
+ </bin>
diff --git a/DQM/EcalCommon/test/testEcalCommon.cc b/DQM/EcalCommon/test/testEcalCommon.cc
new file mode 100644
index 00000000000..d4eb9db6a08
--- /dev/null
+++ b/DQM/EcalCommon/test/testEcalCommon.cc
@@ -0,0 +1,43 @@
+#include "FWCore/TestProcessor/interface/TestProcessor.h"
+#include "FWCore/Utilities/interface/Exception.h"
+#include "FWCore/ServiceRegistry/interface/Service.h"
+
+#define CATCH_CONFIG_MAIN
+#include "catch.hpp"
+
+// Function to run the catch2 tests
+//___________________________________________________________________________________________
+void runTestForAnalyzer(const std::string& baseConfig, const std::string& analyzerName) {
+ edm::test::TestProcessor::Config config{baseConfig};
+
+ SECTION(analyzerName + " base configuration is OK") { REQUIRE_NOTHROW(edm::test::TestProcessor(config)); }
+
+ SECTION("Run with no LuminosityBlocks") {
+ edm::test::TestProcessor tester(config);
+ REQUIRE_NOTHROW(tester.testRunWithNoLuminosityBlocks());
+ }
+}
+
+// Function to generate base configuration string
+//___________________________________________________________________________________________
+std::string generateBaseConfig(const std::string& cfiName, const std::string& analyzerName) {
+ // Define a raw string literal
+ constexpr const char* rawString = R"_(from FWCore.TestProcessor.TestProcess import *
+from DQM.EcalCommon.{}_cfi import {}
+process = TestProcess()
+process.harvester = {}
+process.moduleToTest(process.harvester)
+process.add_(cms.Service('MessageLogger'))
+process.add_(cms.Service('JobReportService'))
+process.add_(cms.Service('DQMStore'))
+ )_";
+
+ // Format the raw string literal using fmt::format
+ return fmt::format(rawString, cfiName, analyzerName, analyzerName);
+}
+
+//___________________________________________________________________________________________
+TEST_CASE("EcalMEFormatter tests", "[EcalMEFormatter]") {
+ const std::string baseConfig = generateBaseConfig("EcalMEFormatter", "ecalMEFormatter");
+ runTestForAnalyzer(baseConfig, "EcalMEFormatter");
+}
results in:
===== Test "testEcalCommon" ====
----- Begin Fatal Exception 04-Sep-2024 03:46:28 CEST-----------------------
An exception of category 'InvalidCall' occurred while
[0] Processing end ProcessBlock
[1] Calling method for module EcalMEFormatter/'harvester'
Exception Message:
Electronics Mapping not initialized
----- End Fatal Exception -------------------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
testEcalCommon is a Catch v2.13.6 host application.
Run with -? for options
-------------------------------------------------------------------------------
EcalMEFormatter tests
Run with no LuminosityBlocks
-------------------------------------------------------------------------------
src/DQM/EcalCommon/test/testEcalCommon.cc:31
...............................................................................
src/DQM/EcalCommon/test/testEcalCommon.cc:33: FAILED:
REQUIRE_NOTHROW( tester.testRunWithNoLuminosityBlocks() )
due to unexpected exception with message:
An exception of category 'InvalidCall' occurred while
[0] Processing end ProcessBlock
[1] Calling method for module EcalMEFormatter/'harvester'
Exception Message:
Electronics Mapping not initialized
===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed
---> test testEcalCommon had ERRORS
TestTime:2
^^^^ End Test testEcalCommon ^^^^
@cms-sw/ecal-dpg-l2 FYI
Is there any monitoring of memory requirements per release (slimmed things tend to grow…)?
not that I am aware of.
There seems to be another failure of this kind in run number 385889. Here's the log: https://cmst0.web.cern.ch/CMST0/tier0/pausedJobs/data/vocms0313/wmagentJob_jobid988223.txt
Hi, we'll have a look at this issue -- Kyungmin on behalf of ECAL DQM team.
DQMHarvesting is exceeding maxPSS in Express reconstruction at T0 in runs 356381, 356615, 356719 link to cmsTalk.
We re-run job on lxplus and they have been completed successfully even if with some warnings (see below)
Looking inside the tarball and local running jobs, it seems that the output of cmsRun shows some warnings in HLTConfigProvider for HLT-EGamma client and at merging step for CTTPS. In order to investigate these warnings two GitHub issues have been opened: https://github.com/cms-sw/cmssw/issues/38969 https://github.com/cms-sw/cmssw/issues/38970
However, looking at a similar issue observed in May, it seems that the warnings are there since long time and are not the root of the problem.
Running the IgProf tool for the memory profiling, it seems that the main memory consumer is the DQMFileSaver DQMFileSaver::saveForOffline function (see /afs/cern.ch/work/r/rosma/public/Run356381/igreport_total.res and /afs/cern.ch/work/r/rosma/public/Run356615/).... but we don't know how much this check is reliable given that Harvesting is not running on events. Can software expert have a look and give us some suggestions? @makortel