Open sunilUIET opened 1 year ago
A new Issue was created by @sunilUIET sunil bansal.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
a log file would be useful
@z4027163 is it possible to retrieve the detail log for the WF?
no need. easy to reproduce (are just 10 events) step3 starts directly at 13GB!
[innocent@lxplus803 relCrash]$ grep RSS step3.log | head MemoryCheck: module PoolOutputModule:NANOEDMAODSIMoutput VSIZE 19215.7 0.0078125 RSS 13361.6 -38.1055 MemoryCheck: module SimpleCandidateFlatTableProducer:isoTrackTable VSIZE 19215.7 0.0078125 RSS 13361.6 0 MemoryCheck: module NanoAODDQM:nanoDQMMC VSIZE 19215.7 0 RSS 13361.9 0.398438
Th eonly thing I can say is that it is normal for a job with 8 threads. Nothing special (we are supposed to stay <2GB per thread). igprof in https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/step_3_cfg__CMSSW_13_0_11_el8_amd64_gcc11_memTot/self https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/step_3_cfg__CMSSW_13_0_11_el8_amd64_gcc11_memPeak/self
single threaded stays in 7GB [innocent@lxplus803 relCrash]$ grep RSS step3_1T.log | head MemoryCheck: module OutsideInMuonSeeder:muonSeededSeedsOutIn VSIZE 8774.16 0.0078125 RSS 6067.46 -121.195 MemoryCheck: module TrackCollectionMerger:preDuplicateMergingGeneralTracks VSIZE 8774.16 0 RSS 6068.55 1.08984 MemoryCheck: module PrimaryVertexProducer:unsortedOfflinePrimaryVertices VSIZE 8774.16 0 RSS 6070.61 2.0625 MemoryCheck: module PFTrackProducer:lowPtGsfElePfTracks VSIZE 8774.16 0 RSS 6071.12 0.515625 MemoryCheck: module PFDisplacedVertexProducer:particleFlowDisplacedVertex VSIZE 8774.16 0 RSS 6071.38 0.257812 MemoryCheck: module TrackVertexArbitrator:trackVertexArbitrator VSIZE 8774.16 0 RSS 6073.45 2.0625 MemoryCheck: module TrackRefitter:refittedForPixelDQM VSIZE 8774.16 0 RSS 6092.27 18.8203 MemoryCheck: module MonitorTrackResiduals:MonitorTrackResiduals VSIZE 8774.16 0 RSS 6092.52 0.257812 MemoryCheck: module MuonAssociatorEDProducer:tpToTkMuonAssociation VSIZE 8774.16 0 RSS 6093.04 0.515625 MemoryCheck: module SiStripTrackingRecHitsValid:StripTrackingRecHitsValid VSIZE 8774.16 0 RSS 6105.41 12.3711 [innocent@lxplus803 relCrash]$ grep RSS step3_1T.log | tail MemoryCheck: module PFCandidateFwdPtrCollectionPdgIdFilter:pfPileUpAllChargedParticles VSIZE 9687.91 0 RSS 6613.33 0.328125 MemoryCheck: module MuonSimClassifier:muonSimClassifier VSIZE 9687.91 0 RSS 6645.05 31.7266 MemoryCheck: module MuonAssociatorEDProducer:tpToPFMuonAssociation VSIZE 9687.91 0 RSS 6646.26 1.20703 MemoryCheck: module BDHadronTrackMonitoringAnalyzer:BDHadronTrackMonitoringAnalyze VSIZE 9687.91 0 RSS 6647.79 1.53125 MemoryCheck: module MultiTrackValidator:trackValidator VSIZE 9687.91 0 RSS 6647.9 0.105469 MemoryCheck: module GlobalRecHitsAnalyzer:globalrechitsanalyze VSIZE 9687.91 0 RSS 6655.64 7.74219 MemoryCheck: module DiDispStaMuonMonitor:hltDiDispStaMuon10CosmicVetoL3Mu0DxyMax1cmMonitoring VSIZE 9687.92 0.0078125 RSS 6655.19 -0.453125 MemoryCheck: module L1TStage2Layer2Producer:valCaloStage2Layer2DigisOffline VSIZE 9687.92 0 RSS 6655.73 0.539062 MemoryCheck: module PoolOutputModule:RECOSIMoutput VSIZE 9687.92 0 RSS 6660.89 5.16016 MemoryCheck: module PFCandidatePrimaryVertexSorter:primaryVertexWithBSAssociation VSIZE 9687.92 0 RSS 6661.4 0.511719
Since when we run RelVal multithreaded?
Since when we run RelVal multithreaded?
I think since the beginning of Run 2 (since we've run multi-threaded production that long).
Anyhow 13GB for 8 threads is inside the limits.
This is the GB used for string cut parsing... (in each single thread) https://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/step_3_cfg__CMSSW_13_0_11_el8_amd64_gcc11_memTot/107
most of it in HLTMuonMatchAndPlot
Hi, do we have any news on this topic? If we need to provide any other information please let us know.
My understanding is that it is expected. An 8 thread job may need more than 12GB (up to 16GB).
Jobs are crossing even more than 16GB leading to crash. I don't think we have possibility to use more 16GB
Can you share a log file of the crash?
With relval submission we have limited log as in [1] and this is produced with cmsdriver commands[2]
[1] https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_13_0_11__fullSim_PU_RV209-ZEE_14-00001 [2] https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_13_0_11__fullSim_PU_RV209-ZEE_14-00001
Sorry w/o the log produced by the job itself I cannot make any analysis.
Yes, memory grows beyond 16GB in 100 events. nothing particular. Either we increase the limit or we need to work and reduce memory footprint of the various modules...
on lxplus it swaps 2.GB so it stays below 16GB... (no it is not a good news)
[innocent@lxplus801 relval]$ cat /proc/462367/smaps_rollup
00400000-fffffbec0000 ---p 00000000 00:00 0 [rollup]
Rss: 14235392 kB
Pss: 14229478 kB
Shared_Clean: 6400 kB
Shared_Dirty: 64 kB
Private_Clean: 449600 kB
Private_Dirty: 13779328 kB
Referenced: 14235392 kB
Anonymous: 13807872 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
FilePmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 2153024 kB
SwapPss: 2153024 kB
Locked: 0 kB
runnning the same config with CMSSW_13_0_10 makes no difference. (config not cmsdriver). Which was the latest version that this relval was successful?
CMSSW_12 does not work with this config as
RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx anywhere in the search path.
the config contains all those models
[innocent@lxplus801 relval]$ grep onnx config.py
weightFile = cms.FileInPath('PhysicsTools/NanoAOD/data/breg_training_2018.onnx')
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
weightFile = cms.FileInPath('PhysicsTools/NanoAOD/data/creg_training_2018.onnx')
model_path = cms.FileInPath('RecoParticleFlow/PFProducer/data/mlpf/mlpf_2021_11_16__no_einsum__all_data_cms-best-of-asha-scikit_20211026_042043_178263.workergpu010.onnx'),
weightFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx')
mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
mvaIDTrainingFile = cms.FileInPath('RecoMuon/MuonIdentification/data/mvaID.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/full/resnet.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertexcombined.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepVertex/phase1_deepvertex.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepBoostedJet/V02/decorrelated/resnet.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDB_mass_independent.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDCvB_mass_independent.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/94X/V01/DDC_mass_independent.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Central/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Central/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Forward/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Forward/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Central/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Central/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Forward/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Forward/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetFromMiniAODAK8/particle-net.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/modelfile/model.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx'),
model_path = cms.FileInPath('RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx'),
Question is: WHY the very same model is opened more than ones?
Previous high stat relval was in 13_0_0 and we haven't faced any issue there. For other higher version, we ran only standard relval where usual event/job is less than 100 and we faced no issue even in 13_0_10. Another information if relevant, For no pileup production, there is no such issue in 13_0_11.
This job exceed 16GB immediately. 13_0_0 does not run either because RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx is not found
So I would say that NEW DeepJet model is a bit too large...
Previous high stat relval was in 13_0_0 and we haven't faced any issue there.
please share a link to this previous relval configuration
I made a simple change to the ONNXRuntime to make each model really global (using a global map) https://github.com/cms-sw/cmssw/compare/master...VinInn:cmssw:onnxRuntime?expand=1
The effect on the memory is essentially negligible as many of the duplicates are actually not instantiated...
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoMuon/MuonIdentification/data/mvaID.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoMuon/MuonIdentification/data/mvaID.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepFlavourV04_12X_training/DeepJet_Run3_122X.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/HiggsInteractionNet/V00/IN.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/MD-2prong/V01/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepDoubleX/102X/V02/BvL.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvB.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/DeepDoubleX/102X/V02/CvL.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK4/CHS/V00/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Central/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/CHS/Forward/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Central/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK4/PUPPI/Forward/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetFromMiniAODAK8/particle-net.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/General/V01/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/ParticleNetAK8/MassRegression/V01/modelfile/model.onnx
creating new runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx
reusing runtime for /cvmfs/cms.cern.ch/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_0_11/external/el8_aarch64_gcc11/data/RecoBTag/Combined/data/RobustParTAK4/PUPPI/V00/RobustParTAK4.onnx
thanks.
Notably, this one had --nStreams 2 --nThreads 8
while the 13_0_11 variant has just --nThreads 8
the latter would definitely need more memory.
Thanks @slava77 for the pointer! I see for standard relval it is usually with nStreams 2 which we do not change manually. Need to say how it came in. Anyway, we are retesting with nStream 2
looking at igprof I do not see anything peculiar: most of the memory is allocated by the tracking validation... http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/self
http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/16 http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/21
What PU is this?
@slava77 @VinInn with --nStream = 2 memory usage seems to be under control. As far as PU is concerned, we are using 2023_Fills_8807_8901_ProjectedPileup_PoissonOOTPU . From the relval side, this ticket can be closed.
Maybe Tracking POG should check if there is not to much OOTPU in this samples (at least for what strip and pixels are concerned) and eventually restrict the number of crossing used in the hit associator.
In my opinion RECO should understand if 2GB per thread is still an achievable goal or one needs to negotiate a higher memory allocation.
btw: here igprof for just one-event. again nothing special: yes it seems dominated by string manipulation and there is memory churn that can be avoided in ParameterSet and in the tokenizer, still...
looking at igprof I do not see anything peculiar: most of the memory is allocated by the tracking validation...
there are all MEM_TOTAL. This is informative for mem churn, but not the current use.\ Do you have a MEM_LIVE?
Wherever I look it seems to me that there are simply way too many PSimHit to fit in 2GB per thread.
with 6 streams is still marginal: can reach 16GB. with 4 streams (and 8 threads) should be ok. Throughput at most few % less.
assign reconstruction, simulation, dqm
FYI @cms-sw/tracking-pog-l2
New categories assigned: dqm,reconstruction,simulation
@mdhildreth,@tjavaid,@micsucmed,@nothingface0,@rvenditti,@mandrenguyen,@emanueleusai,@syuvivida,@clacaputo,@civanch,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks
Maybe Tracking POG should check if there is not to much OOTPU in this samples (at least for what strip and pixels are concerned) and eventually restrict the number of crossing used in the hit associator.
I'm curious though which parts require the simhits. AFAIK, trackingOnly parts of the validation do not need pileup replay.
Maybe Tracking POG should check if there is not to much OOTPU in this samples (at least for what strip and pixels are concerned) and eventually restrict the number of crossing used in the hit associator.
I'm curious though which parts require the simhits. AFAIK, trackingOnly parts of the validation do not need pileup replay.
My vague recollection is that while the tracking validation indeed does not need the pileup replay, the tracker validation has components that use the "more expensive" hit associator(s). Vincenzo's profile (https://github.com/cms-sw/cmssw/issues/42281#issuecomment-1657090649) confirms
http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_13_0_11_slc7_amd64_gcc11_memTot/37
and points to GlobalRecHitsAnalyzer
, SiStripRecHitsValid
, SiStripTrackingRecHitsValid
, and SiPixelPhase1RecHitsV
.
Is the Associator running on the full OOT or is limited only to the crossing meaningful for the detector in question? (-12,+5 is huge for the tracker!)
I doubt that more than [-3,+1] is needed. But doesn't mixing module load all crossing frames with all hits anyways? But I guess the point is to make it configurable.
Indeed there are two levels where actions can be taken to avoid considering SimHits outside the "digitization" time-frame. 1) in the mixing module 2) specifically for the tracker in https://cmssdt.cern.ch/dxr/CMSSW/source/SimTracker/TrackerHitAssociation/src/TrackerHitAssociator.cc#137
We observe excessing memory even crossing 16G leading MC relval failures[1] in CMSSW_13_0_11. Corresponding cmsdriver commands can be found here [2]. It may or may not be related to the memory issue reported with data re-processing https://github.com/cms-sw/cmssw/issues/40437
[1] https://cms-unified.web.cern.ch/cms-unified/showlog/?search=CMSSW_13_0_11__fullSim_PU_14TeV_RV204-TTbar_14TeV-00002 [2] https://cms-pdmv.cern.ch/relval/api/relvals/get_cmsdriver/CMSSW_13_0_11__fullSim_PU_14TeV_RV204-TTbar_14TeV-00002