Using neutrino gun events to probe memory growth

Dr15Jones commented 2 weeks ago

I started from workflow 12861.0 (which is a neutrino gun) and then modified step3 (reconstruction) by

Replaced PoolSource with RepeatingCachedRootSource. This allows the job to run an infinite number of events without any memory growth from the source (as all events are already cached in memory)
Replaced PoolOutputModule with AsciiOutputModule both using identical outputCommands and verbosity turned off. This avoids problems with the output file sizings growing too large and makes the jobs run faster.
Depending on the exact measurement being done, added Service SimpleMemoryCheck with periodic (event 10 second) sampling on and/or Service PeriodicAllocMonitor (sampling every 10 seconds).
Set framework-has-started-processing-event message to only print once every 100 events (makes job run faster and avoids clogging the output file)

The full step3 cmsDriver.py command I started from was

step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root

cmsbuild commented 2 weeks ago

cms-bot internal usage

cmsbuild commented 2 weeks ago

A new Issue was created by @Dr15Jones.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 2 weeks ago

assign reconstruction, dqm

cmsbuild commented 2 weeks ago

New categories assigned: reconstruction,dqm

@antoniovagnerini,@jfernan2,@mandrenguyen,@rseidita you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 2 weeks ago

From the RSS and presently allocated memory measurements I get the following time profile of the job (8 threads and 8 streams)

If I exclude the startup memory growth and plot allocated memory versus number of events started in the job we get

which shows an approximate 330 byte memory growth per event processed.

Trying to make a similar plot using the RSS information shows more variance

Dr15Jones commented 2 weeks ago

Without the PeriodicAllocMonitor (and not preloading libPerfToolsAllocMonitorPreload.so) to job runs at a poor 24 events/second. The modules taking the most time are

The events are empty so taking ~20 to ~5 ms to process nothing seems pretty high.

Dr15Jones commented 2 weeks ago

Thread scaling in the job is hampered by the following one modules which the framework must work around. [These showed up while running the StallMonitoring service]

StallMonitor>         Module label  # of stalls  Total stalled time  Max stalled time
StallMonitor>  -------------------  -----------  ------------------  ----------------
StallMonitor>  ecalMonitorTask                3          0.347438 s        0.129493 s
StallMonitor>  NANOEDMAODSIMoutput           10           2.86258 s          0.4244 s
StallMonitor>  SKIMStreamLogError             7           1.30872 s        0.266045 s
StallMonitor>  particleLevel                  6           1.08233 s         0.18974 s
StallMonitor>  rivetProducerHTXS              6           1.08204 s        0.189733 s

NOTE: the NANOEDMAODSIMoutput and SKIMStreamLogError had not yet been replaced with AsciiOutputModule at the time I made the measurement.

Dr15Jones commented 2 weeks ago

My intent is to write a customize function which could be used to create a standard test for memory leaking/hoarding.

Dr15Jones commented 2 weeks ago

From the PeriodicAllocMonitor information, we do 1.1 million allocation requests each event. Some of that will be from copying the cached input data but the rest is just churn.

Revised: I ran a job with just the RepeatingCachedRootSource and AsciiOutputModules (I removed the DQM output module as no DQM modules are in the job) where the job did not read the data products from the cache, just asked for which event to run next. It took less than 80 seconds to process 1M events using 8 threads and 8 streams, and the average number of allocations per event was less than 200.

Then if I did ask for each data product from each event it took 265 seconds to process 100k events (380 events/s) and it averaged 3400 allocations per event.

I ran the PeriodicAlloMonitor for both of those jobs (which does slow it down some) and there is no sign of any memory increase.

Dr15Jones commented 2 weeks ago

See #46603

Dr15Jones commented 2 weeks ago

So I dropped the DQM parts of the configuration and re-ran processing 20,000 events (just to make things faster). The allocation monitoring does not show any memory increase so it appears that the 300+ byte/event memory increase is coming from DQM related code. In addition the average live-allocation is only 3.2GB with the number of per event allocations being 450k. Therefore DQM is contributing 3+GB of memory (at 8 threads/8streams) and more than doubles the number of allocations per event.

Dr15Jones commented 2 weeks ago

So I used a script to generate a configuration that matches the step3 for module and path dependencies but with each producer replaced with the trivial AddIntsProducer producer. This allows us to see the effect on framework infrastructure on the memory behavior.

Running over 70,000 events using 1 thread I see that the job was doing 9,000 allocations per event with an average amounted allocated per event of 270kB (we had 2000 modules which each module doing at least 2 new calls, one for each of the two products it puts into the Event) and the live allocation stayed at a near constant 22.6MB.

Running over 140,000 events using 8 threads gave the same allocation and memory per event and a live allocation of 83.3MB.

So the framework only accounts for 1% of the number of allocations being done per event.

Dr15Jones commented 2 weeks ago

I used the fixed ModuleAllocMonitor service to look at the job for modules that do lots of allocations during event processing. Here are the largest

valEmtfStage2Digis: L1TMuonEndCapTrackProducer uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here https://github.com/cms-sw/cmssw/blob/aca061d3ab55040afa3278e946bfd4b1b0573ed4/L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapTrackProducer.cc#L19

slimmedPatTrigger:PATTriggerObjectStandAloneSlimmer uses 1.3M allocations on the first event of 6 of the 8 streams in the job and then just 100 on all others. The startup cost comes from here https://github.com/cms-sw/cmssw/blob/aca061d3ab55040afa3278e946bfd4b1b0573ed4/DataFormats/PatCandidates/src/TriggerObjectStandAlone.cc#L389-L394 and the reason only 6 of 8 are affected is those 6 streams hit that code section at about the same time so they are all trying to fill in that thread safe global.

patTrigger:PATTriggerProducer always allocates 330k times each event. This is doing lots of stuff on the HLT trigger information which is still filled in for this workflow even though the events are empty.

dqmHLTFiltersDQMonitor:HLTFiltersDQMonitor always allocates 158k times each event.

valBmtfDigis:L1TMuonBarrelTrackProducer always allocates 85k times each event.

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

All subsequent modules allocate less than 50k times each event.

makortel commented 2 weeks ago

valEmtfStage2Digis: L1TMuonEndCapTrackProducer uses 1.3M allocations on the first event of each stream and then 20k for all subsequent events. The startup is from here

https://github.com/cms-sw/cmssw/blob/aca061d3ab55040afa3278e946bfd4b1b0573ed4/L1Trigger/L1TMuonEndCap/plugins/L1TMuonEndCapTrackProducer.cc#L19

This one already has an open issue https://github.com/cms-sw/cmssw/issues/42526

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

This one has an open issue as well https://github.com/cms-sw/cmssw/issues/46498

Dr15Jones commented 2 weeks ago

See #46628 which decreases patTrigger:PATTriggerProducer allocations from 330k to 48k.

mmusich commented 2 weeks ago

HLTSiStripMonitorTrack: SiStripMonitorTrack always allocates 65k times each event.

for my own education, can you explain what is the mechanics behind obtaining these numbers? I've been trying the following recipe:

cmsrel CMSSW_14_2_X_2024-11-07-2300
cd CMSSW_14_2_X_2024-11-07-2300/src/
cmsenv
runTheMatrix.py -l 12861.0 --maxSteps=2
cd 12861.0_NuGun+2024
cmsDriver.py step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,SKIM:LogError+LogErrorMonitor,PAT,NANO,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions auto:phase1_2024_realistic --datatier RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3_2024 --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --filein file:step2.root --fileout file:step3.root --no_exec
echo 'process.add_(cms.Service("ModuleAllocMonitor", fileName = cms.untracked.string("moduleAlloc.log")))' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
echo 'process.ModuleAllocMonitor.moduleNames = cms.untracked.vstring(["HLTSiStripMonitorTrack"])' >> step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
setenv LD_PRELOAD "libPerfToolsAllocMonitorPreload.so libPerfToolsMaxMemoryPreload.so"
cmsRun step3_RAW2DIGI_L1Reco_RECO_RECOSIM_SKIM_PAT_NANO_DQM.py
wget https://raw.githubusercontent.com/cms-sw/cmssw/refs/heads/master/PerfTools/AllocMonitor/scripts/edmModuleAllocMonitorAnalyze.py .
python3 edmModuleAllocMonitorAnalyze.py moduleAlloc.log

(notice that edmModuleAllocMonitorAnalyze.py was not in the path, I tried to amend that at https://github.com/cms-sw/cmssw/pull/46633)

That gives me:

Memory Report: total memory requested: 95750866
Memory Report:  max memory used: 9708024
Memory Report:  presently used: 643496
Memory Report:  # allocations calls:   17393
Memory Report:  # deallocations calls: 17024

how should I read these numbers?

Dr15Jones commented 2 weeks ago

@mmusich you actually load two different monitors, one a Service (ModuleAllocMonitor) and the other which starts because you added it to LD_PRELOAD (libPerfToolsMaxMemoryPreload.so). The printout you refer to is from the LD_PRELOAD library and refers to statistics about the entire program.

The data obtained by the ModuleAllocMonitor was written to the file you specified in the Service's configuration, i.e. moduleAlloc.log). That is where you want to look for information specific to HLTSiStripMonitorTrack. In principal you should be able to point edmModuleAllocMonitorAnalyze.py at the file moduleAlloc.log to get useful information.

Dr15Jones commented 2 weeks ago

The 85k from L1TMuonBarrelTrackProducer is brought down to 19 for all but the first event of a stream in #46640.

cms-sw / cmssw

Using neutrino gun events to probe memory growth #46598