Open Dr15Jones opened 2 weeks ago
cms-bot internal usage
A new Issue was created by @Dr15Jones.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, dqm
New categories assigned: reconstruction,dqm
@antoniovagnerini,@jfernan2,@mandrenguyen,@rseidita you have been requested to review this Pull request/Issue and eventually sign? Thanks
From the RSS and presently allocated memory measurements I get the following time profile of the job (8 threads and 8 streams)
If I exclude the startup memory growth and plot allocated memory versus number of events started in the job we get
which shows an approximate 330 byte memory growth per event processed.
Trying to make a similar plot using the RSS information shows more variance
Without the PeriodicAllocMonitor
(and not preloading libPerfToolsAllocMonitorPreload.so) to job runs at a poor 24 events/second. The modules taking the most time are
The events are empty so taking ~20 to ~5 ms to process nothing seems pretty high.
Thread scaling in the job is hampered by the following one
modules which the framework must work around. [These showed up while running the StallMonitoring service]
StallMonitor> Module label # of stalls Total stalled time Max stalled time
StallMonitor> ------------------- ----------- ------------------ ----------------
StallMonitor> ecalMonitorTask 3 0.347438 s 0.129493 s
StallMonitor> NANOEDMAODSIMoutput 10 2.86258 s 0.4244 s
StallMonitor> SKIMStreamLogError 7 1.30872 s 0.266045 s
StallMonitor> particleLevel 6 1.08233 s 0.18974 s
StallMonitor> rivetProducerHTXS 6 1.08204 s 0.189733 s
NOTE: the NANOEDMAODSIMoutput and SKIMStreamLogError had not yet been replaced with AsciiOutputModule at the time I made the measurement.
My intent is to write a customize
function which could be used to create a standard test for memory leaking/hoarding.
From the PeriodicAllocMonitor information, we do 1.1 million allocation requests each event. Some of that will be from copying the cached input data but the rest is just churn.
Revised: I ran a job with just the RepeatingCachedRootSource and AsciiOutputModules (I removed the DQM output module as no DQM modules are in the job) where the job did not read the data products from the cache, just asked for which event to run next. It took less than 80 seconds to process 1M events using 8 threads and 8 streams, and the average number of allocations per event was less than 200.
Then if I did ask for each data product from each event it took 265 seconds to process 100k events (380 events/s) and it averaged 3400 allocations per event.
I ran the PeriodicAlloMonitor for both of those jobs (which does slow it down some) and there is no sign of any memory increase.
See #46603
So I dropped the DQM parts of the configuration and re-ran processing 20,000 events (just to make things faster). The allocation monitoring does not show any memory increase so it appears that the 300+ byte/event memory increase is coming from DQM related code. In addition the average live-allocation is only 3.2GB with the number of per event allocations being 450k. Therefore DQM is contributing 3+GB of memory (at 8 threads/8streams) and more than doubles the number of allocations per event.
So I used a script to generate a configuration that matches the step3 for module and path dependencies but with each producer replaced with the trivial AddIntsProducer
producer. This allows us to see the effect on framework infrastructure on the memory behavior.
Running over 70,000 events using 1 thread I see that the job was doing 9,000 allocations per event with an average amounted allocated per event of 270kB (we had 2000 modules which each module doing at least 2 new
calls, one for each of the two products it puts into the Event) and the live allocation stayed at a near constant 22.6MB.
Running over 140,000 events using 8 threads gave the same allocation and memory per event and a live allocation of 83.3MB.
So the framework only accounts for 1% of the number of allocations being done per event.
I used the fixed ModuleAllocMonitor service to look at the job for modules that do lots of allocations during event processing. Here are the largest
valEmtfStage2Digis: L1TMuonEndCapTrackProducer
uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here
https://github.com/cms-sw/cmssw/blob/aca061d3ab55040afa3278e946bfd4b1b0573ed4/L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapTrackProducer.cc#L19
slimmedPatTrigger:PATTriggerObjectStandAloneSlimmer
uses 1.3M allocations on the first event of 6 of the 8 streams in the job and then just 100 on all others. The startup cost comes from here
https://github.com/cms-sw/cmssw/blob/aca061d3ab55040afa3278e946bfd4b1b0573ed4/DataFormats/PatCandidates/src/TriggerObjectStandAlone.cc#L389-L394
and the reason only 6 of 8 are affected is those 6 streams hit that code section at about the same time so they are all trying to fill in that thread safe global.
patTrigger:PATTriggerProducer
always allocates 330k times each event. This is doing lots of stuff on the HLT trigger information which is still filled in for this workflow even though the events are empty.
dqmHLTFiltersDQMonitor:HLTFiltersDQMonitor
always allocates 158k times each event.
valBmtfDigis:L1TMuonBarrelTrackProducer
always allocates 85k times each event.
HLTSiStripMonitorTrack: SiStripMonitorTrack
always allocates 65k times each event.
All subsequent modules allocate less than 50k times each event.
valEmtfStage2Digis: L1TMuonEndCapTrackProducer
uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here
This one already has an open issue https://github.com/cms-sw/cmssw/issues/42526
HLTSiStripMonitorTrack: SiStripMonitorTrack
always allocates 65k times each event.
This one has an open issue as well https://github.com/cms-sw/cmssw/issues/46498
See #46628 which decreases patTrigger:PATTriggerProducer
allocations from 330k to 48k.
HLTSiStripMonitorTrack: SiStripMonitorTrack
always allocates 65k times each event.
for my own education, can you explain what is the mechanics behind obtaining these numbers? I've been trying the following recipe:
cmsrel CMSSW_14_2_X_2024-11-07-2300
cd CMSSW_14_2_X_2024-11-07-2300/src/
cmsenv
runTheMatrix.py -l 12861.0 --maxSteps=2
cd 12861.0_NuGun+2024
cmsDriver.py step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root --no_exec
echo 'process.add_(cms.Service("ModuleAllocMonitor", fileName = cms.untracked.string("moduleAlloc.log")))' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
echo 'process.ModuleAllocMonitor.moduleNames = cms.untracked.vstring(["HLTSiStripMonitorTrack"])' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
setenv LD_PRELOAD "libPerfToolsAllocMonitorPreload.so libPerfToolsMaxMemoryPreload.so"
cmsRun step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
wget https://raw.githubusercontent.com/cms-sw/cmssw/refs/heads/master/PerfTools/AllocMonitor/scripts/edmModuleAllocMonitorAnalyze.py .
python3 edmModuleAllocMonitorAnalyze.py moduleAlloc.log
(notice that edmModuleAllocMonitorAnalyze.py
was not in the path, I tried to amend that at https://github.com/cms-sw/cmssw/pull/46633)
That gives me:
Memory Report: total memory requested: 95750866
Memory Report: max memory used: 9708024
Memory Report: presently used: 643496
Memory Report: # allocations calls: 17393
Memory Report: # deallocations calls: 17024
how should I read these numbers?
@mmusich you actually load two different monitors, one a Service (ModuleAllocMonitor
) and the other which starts because you added it to LD_PRELOAD (libPerfToolsMaxMemoryPreload.so
). The printout you refer to is from the LD_PRELOAD library and refers to statistics about the entire program.
The data obtained by the ModuleAllocMonitor
was written to the file you specified in the Service's configuration, i.e. moduleAlloc.log
). That is where you want to look for information specific to HLTSiStripMonitorTrack
. In principal you should be able to point edmModuleAllocMonitorAnalyze.py
at the file moduleAlloc.log
to get useful information.
The 85k from L1TMuonBarrelTrackProducer
is brought down to 19 for all but the first event of a stream in #46640.
I started from workflow 12861.0 (which is a neutrino gun) and then modified step3 (reconstruction) by
PoolSource
withRepeatingCachedRootSource
. This allows the job to run an infinite number of events without any memory growth from the source (as all events are already cached in memory)PoolOutputModule
withAsciiOutputModule
both using identicaloutputCommands
andverbosity
turned off. This avoids problems with the output file sizings growing too large and makes the jobs run faster.SimpleMemoryCheck
with periodic (event 10 second) sampling on and/or ServicePeriodicAllocMonitor
(sampling every 10 seconds).The full step3 cmsDriver.py command I started from was