cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.06k stars 4.26k forks source link

excessing memory usage in CMSSW_13_0_11 #42281

Open sunilUIET opened 1 year ago

sunilUIET commented 1 year ago

We observe excessing memory even crossing 16G leading MC relval failures[1] in CMSSW_13_0_11. Corresponding cmsdriver commands can be found here [2]. It may or may not be related to the memory issue reported with data re-processing https://github.com/cms-sw/cmssw/issues/40437

[1] https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_13_0_11__fullSim_PU_14TeV_RV204-TTbar_14TeV-00002 [2] https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_13_0_11__fullSim_PU_14TeV_RV204-TTbar_14TeV-00002

cmsbuild commented 1 year ago

A new Issue was created by @sunilUIET sunil bansal.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn commented 1 year ago

a log file would be useful

sunilUIET commented 1 year ago

@z4027163 is it possible to retrieve the detail log for the WF?

VinInn commented 1 year ago

no need. easy to reproduce (are just 10 events) step3 starts directly at 13GB!

[innocent@lxplus803 relCrash]$ grep RSS step3.log | head MemoryCheck: module PoolOutputModule:NANOEDMAODSIMoutput VSIZE 19215.7 0.0078125 RSS 13361.6 -38.1055 MemoryCheck: module SimpleCandidateFlatTableProducer:isoTrackTable VSIZE 19215.7 0.0078125 RSS 13361.6 0 MemoryCheck: module NanoAODDQM:nanoDQMMC VSIZE 19215.7 0 RSS 13361.9 0.398438

VinInn commented 1 year ago

Th eonly thing I can say is that it is normal for a job with 8 threads. Nothing special (we are supposed to stay <2GB per thread). igprof in https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/step_3_cfg__CMSSW_13_0_11_el8_amd64_gcc11_memTot/self https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/step_3_cfg__CMSSW_13_0_11_el8_amd64_gcc11_memPeak/self

single threaded stays in 7GB [innocent@lxplus803 relCrash]$ grep RSS step3_1T.log | head MemoryCheck: module OutsideInMuonSeeder:muonSeededSeedsOutIn VSIZE 8774.16 0.0078125 RSS 6067.46 -121.195 MemoryCheck: module TrackCollectionMerger:preDuplicateMergingGeneralTracks VSIZE 8774.16 0 RSS 6068.55 1.08984 MemoryCheck: module PrimaryVertexProducer:unsortedOfflinePrimaryVertices VSIZE 8774.16 0 RSS 6070.61 2.0625 MemoryCheck: module PFTrackProducer:lowPtGsfElePfTracks VSIZE 8774.16 0 RSS 6071.12 0.515625 MemoryCheck: module PFDisplacedVertexProducer:particleFlowDisplacedVertex VSIZE 8774.16 0 RSS 6071.38 0.257812 MemoryCheck: module TrackVertexArbitrator:trackVertexArbitrator VSIZE 8774.16 0 RSS 6073.45 2.0625 MemoryCheck: module TrackRefitter:refittedForPixelDQM VSIZE 8774.16 0 RSS 6092.27 18.8203 MemoryCheck: module MonitorTrackResiduals:MonitorTrackResiduals VSIZE 8774.16 0 RSS 6092.52 0.257812 MemoryCheck: module MuonAssociatorEDProducer:tpToTkMuonAssociation VSIZE 8774.16 0 RSS 6093.04 0.515625 MemoryCheck: module SiStripTrackingRecHitsValid:StripTrackingRecHitsValid VSIZE 8774.16 0 RSS 6105.41 12.3711 [innocent@lxplus803 relCrash]$ grep RSS step3_1T.log | tail MemoryCheck: module PFCandidateFwdPtrCollectionPdgIdFilter:pfPileUpAllChargedParticles VSIZE 9687.91 0 RSS 6613.33 0.328125 MemoryCheck: module MuonSimClassifier:muonSimClassifier VSIZE 9687.91 0 RSS 6645.05 31.7266 MemoryCheck: module MuonAssociatorEDProducer:tpToPFMuonAssociation VSIZE 9687.91 0 RSS 6646.26 1.20703 MemoryCheck: module BDHadronTrackMonitoringAnalyzer:BDHadronTrackMonitoringAnalyze VSIZE 9687.91 0 RSS 6647.79 1.53125 MemoryCheck: module MultiTrackValidator:trackValidator VSIZE 9687.91 0 RSS 6647.9 0.105469 MemoryCheck: module GlobalRecHitsAnalyzer:globalrechitsanalyze VSIZE 9687.91 0 RSS 6655.64 7.74219 MemoryCheck: module DiDispStaMuonMonitor:hltDiDispStaMuon10CosmicVetoL3Mu0DxyMax1cmMonitoring VSIZE 9687.92 0.0078125 RSS 6655.19 -0.453125 MemoryCheck: module L1TStage2Layer2Producer:valCaloStage2Layer2DigisOffline VSIZE 9687.92 0 RSS 6655.73 0.539062 MemoryCheck: module PoolOutputModule:RECOSIMoutput VSIZE 9687.92 0 RSS 6660.89 5.16016 MemoryCheck: module PFCandidatePrimaryVertexSorter:primaryVertexWithBSAssociation VSIZE 9687.92 0 RSS 6661.4 0.511719

VinInn commented 1 year ago

Since when we run RelVal multithreaded?

Dr15Jones commented 1 year ago

Since when we run RelVal multithreaded?

I think since the beginning of Run 2 (since we've run multi-threaded production that long).

VinInn commented 1 year ago

Anyhow 13GB for 8 threads is inside the limits.

VinInn commented 1 year ago

This is the GB used for string cut parsing... (in each single thread) https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/step_3_cfg__CMSSW_13_0_11_el8_amd64_gcc11_memTot/107

most of it in HLTMuonMatchAndPlot

sunilUIET commented 11 months ago

Hi, do we have any news on this topic? If we need to provide any other information please let us know.

VinInn commented 11 months ago

My understanding is that it is expected. An 8 thread job may need more than 12GB (up to 16GB).

sunilUIET commented 11 months ago

Jobs are crossing even more than 16GB leading to crash. I don't think we have possibility to use more 16GB

VinInn commented 11 months ago

Can you share a log file of the crash?

sunilUIET commented 11 months ago

With relval submission we have limited log as in [1] and this is produced with cmsdriver commands[2]

[1] https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_13_0_11__fullSim_PU_RV209-ZEE_14-00001 [2] https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_13_0_11__fullSim_PU_RV209-ZEE_14-00001

VinInn commented 11 months ago

Sorry w/o the log produced by the job itself I cannot make any analysis.

VinInn commented 11 months ago

found this https://cms-unified.web.cern.ch/cms-unified/condorlogs/pdmvserv_RVCMSSW_13_0_11ZEE_14__RV209_230727_063833_4566/50660/RecoNano_2023/2269704a-3334-4c76-9bca-44190dd5587d-1-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

Yes, memory grows beyond 16GB in 100 events. nothing particular. Either we increase the limit or we need to work and reduce memory footprint of the various modules...

VinInn commented 11 months ago

on lxplus it swaps 2.GB so it stays below 16GB... (no it is not a good news)

[innocent@lxplus801 relval]$ cat /proc/462367/smaps_rollup
00400000-fffffbec0000 ---p 00000000 00:00 0                              [rollup]
Rss:            14235392 kB
Pss:            14229478 kB
Shared_Clean:       6400 kB
Shared_Dirty:         64 kB
Private_Clean:    449600 kB
Private_Dirty:  13779328 kB
Referenced:     14235392 kB
Anonymous:      13807872 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:            2153024 kB
SwapPss:         2153024 kB
Locked:                0 kB
VinInn commented 11 months ago

runnning the same config with CMSSW_13_0_10 makes no difference. (config not cmsdriver). Which was the latest version that this relval was successful?

VinInn commented 11 months ago

CMSSW_12 does not work with this config as

 RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx anywhere in the search path.
VinInn commented 11 months ago

the config contains all those models

[innocent@lxplus801 relval]$ grep onnx config.py
    weightFile = cms.FileInPath('PhysicsTools/NanoAOD/data/breg_training_2018.onnx')
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    weightFile = cms.FileInPath('PhysicsTools/NanoAOD/data/creg_training_2018.onnx')
    model_path = cms.FileInPath('RecoParticleFlow/PFProducer/data/mlpf/mlpf_2021_11_16__no_einsum__all_data_cms-best-of-asha-scikit_20211026_042043_178263.workergpu010.onnx'),
    weightFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx')
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertexcombined.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertex.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/decorrelated/resnet.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC_mass_independent.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Central/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Central/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Forward/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Forward/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Central/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Central/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Forward/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Forward/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK8/particle-net.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/modelfile/model.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx'),
    model_path = cms.FileInPath('RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx'),

Question is: WHY the very same model is opened more than ones?

sunilUIET commented 11 months ago

Previous high stat relval was in 13_0_0 and we haven't faced any issue there. For other higher version, we ran only standard relval where usual event/job is less than 100 and we faced no issue even in 13_0_10. Another information if relevant, For no pileup production, there is no such issue in 13_0_11.

VinInn commented 11 months ago

This job exceed 16GB immediately. 13_0_0 does not run either because RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx is not found

So I would say that NEW DeepJet model is a bit too large...

slava77 commented 11 months ago

Previous high stat relval was in 13_0_0 and we haven't faced any issue there.

please share a link to this previous relval configuration

sunilUIET commented 11 months ago

https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_13_0_0__fullsim_PU_2022_14TeV_HS-TTbar_14TeV-00003

VinInn commented 11 months ago

I made a simple change to the ONNXRuntime to make each model really global (using a global map) https://github.com/cms-sw/cmssw/compare/master...VinInn:cmssw:onnxRuntime?expand=1

The effect on the memory is essentially negligible as many of the duplicates are actually not instantiated...

creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoMuon/MuonIdentification/data/mvaID.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoMuon/MuonIdentification/data/mvaID.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Central/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Forward/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Central/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Forward/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK8/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx
slava77 commented 11 months ago

https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_13_0_0__fullsim_PU_2022_14TeV_HS-TTbar_14TeV-00003

thanks. Notably, this one had --nStreams 2 --nThreads 8 while the 13_0_11 variant has just --nThreads 8 the latter would definitely need more memory.

sunilUIET commented 11 months ago

Thanks @slava77 for the pointer! I see for standard relval it is usually with nStreams 2 which we do not change manually. Need to say how it came in. Anyway, we are retesting with nStream 2

VinInn commented 11 months ago

looking at igprof I do not see anything peculiar: most of the memory is allocated by the tracking validation... http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/self

http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/16 http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/21

VinInn commented 11 months ago

What PU is this?

sunilUIET commented 11 months ago

@slava77 @VinInn with --nStream = 2 memory usage seems to be under control. As far as PU is concerned, we are using 2023_Fills_8807_8901_ProjectedPileup_PoissonOOTPU . From the relval side, this ticket can be closed.

VinInn commented 11 months ago

Maybe Tracking POG should check if there is not to much OOTPU in this samples (at least for what strip and pixels are concerned) and eventually restrict the number of crossing used in the hit associator.

In my opinion RECO should understand if 2GB per thread is still an achievable goal or one needs to negotiate a higher memory allocation.

VinInn commented 11 months ago

btw: here igprof for just one-event. again nothing special: yes it seems dominated by string manipulation and there is memory churn that can be avoided in ParameterSet and in the tokenizer, still...

slava77 commented 11 months ago

looking at igprof I do not see anything peculiar: most of the memory is allocated by the tracking validation...

there are all MEM_TOTAL. This is informative for mem churn, but not the current use.\ Do you have a MEM_LIVE?

VinInn commented 11 months ago

this is mem live for 10 events https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memPeak/cumulative and for one event https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config_1_CMSSW_13_0_11_slc7_amd64_gcc11_memPeak

VinInn commented 11 months ago

Wherever I look it seems to me that there are simply way too many PSimHit to fit in 2GB per thread.

VinInn commented 11 months ago

with 6 streams is still marginal: can reach 16GB. with 4 streams (and 8 threads) should be ok. Throughput at most few % less.

makortel commented 11 months ago

assign reconstruction, simulation, dqm

FYI @cms-sw/tracking-pog-l2

cmsbuild commented 11 months ago

New categories assigned: dqm,reconstruction,simulation

@mdhildreth,@tjavaid,@micsucmed,@nothingface0,@rvenditti,@mandrenguyen,@emanueleusai,@syuvivida,@clacaputo,@civanch,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

slava77 commented 11 months ago

Maybe Tracking POG should check if there is not to much OOTPU in this samples (at least for what strip and pixels are concerned) and eventually restrict the number of crossing used in the hit associator.

I'm curious though which parts require the simhits. AFAIK, trackingOnly parts of the validation do not need pileup replay.

makortel commented 11 months ago

Maybe Tracking POG should check if there is not to much OOTPU in this samples (at least for what strip and pixels are concerned) and eventually restrict the number of crossing used in the hit associator.

I'm curious though which parts require the simhits. AFAIK, trackingOnly parts of the validation do not need pileup replay.

My vague recollection is that while the tracking validation indeed does not need the pileup replay, the tracker validation has components that use the "more expensive" hit associator(s). Vincenzo's profile (https://github.com/cms-sw/cmssw/issues/42281#issuecomment-1657090649) confirms http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/37 and points to GlobalRecHitsAnalyzer, SiStripRecHitsValid, SiStripTrackingRecHitsValid, and SiPixelPhase1RecHitsV.

VinInn commented 11 months ago

Is the Associator running on the full OOT or is limited only to the crossing meaningful for the detector in question? (-12,+5 is huge for the tracker!)

slava77 commented 11 months ago

I doubt that more than [-3,+1] is needed. But doesn't mixing module load all crossing frames with all hits anyways? But I guess the point is to make it configurable.

VinInn commented 11 months ago

Indeed there are two levels where actions can be taken to avoid considering SimHits outside the "digitization" time-frame. 1) in the mixing module 2) specifically for the tracker in https://cmssdt.cern.ch/dxr/CMSSW/source/SimTracker/TrackerHitAssociation/src/TrackerHitAssociator.cc#137