cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Failures in Run 3 data reprocessing #40437

Open kskovpen opened 1 year ago

kskovpen commented 1 year ago

We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693

The crash message is:

Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0

PdmV

valsdav commented 1 year ago

@valsdav Now that we know that DeepMet is taking the most memory by far, I'd like to ask, how is the back port of #40284 you mentioned going? Dima reported that due to an error much of the work already done on this workflow has to be redone.

Hi @sextonkennedy, I tested the backport but I didn't see a large memory difference in the profiling. At the end the graph itself of the DeepMET model doesn't use a lot of memory. I don't this is crucial to backport.

Instead, we are in contact with the Tau POG to add a couple of safeguards for the DeepTauId.

VinInn commented 1 year ago

I run single thread on the "usual" two files:

it starts

MemoryCheck: module source:source VSIZE 7062.66 0 RSS 5512.03 0.109375
MemoryCheck: module SiPixelClusterProducer:siPixelClustersPreSplitting@cpu VSIZE 7062.66 0 RSS 5512.52 0.488281
MemoryCheck: module HitPairEDProducer:initialStepHitDoubletsPreSplitting VSIZE 7062.66 0 RSS 5514.52 2
MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 7062.66 0 RSS 5518.53 4.01172
MemoryCheck: module MkFitSiStripHitConverter:mkFitSiStripHits VSIZE 7062.66 0 RSS 5524.53 6
MemoryCheck: module TrackProducer:initialStepTracksPreSplitting VSIZE 7062.66 0 RSS 5524.54 0.0078125
MemoryCheck: module TrackTfClassifier:initialStep VSIZE 7062.66 0 RSS 5524.57 0.03125
MemoryCheck: module CAHitTripletEDProducer:highPtTripletStepHitTriplets VSIZE 7062.66 0 RSS 5532.57 8
MemoryCheck: module CAHitTripletEDProducer:lowPtTripletStepHitTriplets VSIZE 7062.66 0 RSS 5532.57 0.00390625
MemoryCheck: module CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets VSIZE 7062.66 0 RSS 5532.58 0.00390625

and ends 

[innocent@olspx-01 tauCrash]$ grep 'RSS' fullJob1T.log | tail MemoryCheck: module ShiftedParticleMETcorrInputProducer:shiftedPatMETCorrUnclusteredEnUpPuppi VSIZE 10222.8 0 RSS 7051.5 0.0625 MemoryCheck: module ShiftedParticleProducer:shiftedPatUnclusteredEnDownPuppi VSIZE 10222.8 0 RSS 7051.56 0.0625 MemoryCheck: module PuppiProducer:packedpuppi VSIZE 10222.8 0 RSS 7051.59 0.0351562 MemoryCheck: module CandPtrSelector:pfCHS VSIZE 10222.8 0 RSS 7051.6 0.00390625 MemoryCheck: module DeepMETProducer:deepMETsResolutionTune VSIZE 10222.8 0 RSS 7053.73 2.13672 MemoryCheck: module PATLostTracks:lostTracks VSIZE 10222.8 0 RSS 7053.96 0.230469 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetAK4JetTagsSlimmedPuppiWithDeepTags VSIZE 10222.8 0 RSS 7053.99 0.0234375 MemoryCheck: module DeepBoostedJetTagInfoProducer:pfHiggsInteractionNetTagInfosSlimmedAK8DeepTags VSIZE 10222.8 0 RSS 7054.01 0.0234375 MemoryCheck: module DisplacedMuonFilterProducer:filteredDisplacedMuons VSIZE 10222.8 0 RSS 7054.02 0.0078125 MemoryCheck: module PoolOutputModule:MINIAODoutput VSIZE 10222.8 0 RSS 7054.02 0.00390625


th largest steps are

[innocent@olspx-01 tauCrash]$ grep 'RSS' fullJob1T.log | awk '{print $9, $_}' | sort -n | tail -n 20 6.75 MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 8726.72 0 RSS 6278.86 6.75 7.52344 MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 8726.72 0 RSS 5950.88 7.52344 8 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 10222.8 0 RSS 6773.98 8 8 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 8726.72 0 RSS 5958.88 8 8 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 8726.72 0 RSS 6019.27 8 8 MemoryCheck: module CAHitTripletEDProducer:highPtTripletStepHitTriplets VSIZE 7062.66 0 RSS 5532.57 8 8.22266 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 7830.71 0 RSS 5699.56 8.22266 8.43359 MemoryCheck: module MkFitProducer:initialStepTrackCandidatesMkFitPreSplitting VSIZE 7830.71 0 RSS 5624.93 8.43359 8.80078 MemoryCheck: module MuonIdProducer:earlyMuons VSIZE 10222.8 0 RSS 6787.71 8.80078 8.82422 MemoryCheck: module MkFitSiStripHitConverter:mkFitSiStripHits VSIZE 7830.71 0 RSS 5327.18 8.82422 9.14844 MemoryCheck: module PATPackedCandidateProducer:packedPFCandidates VSIZE 7830.71 0 RSS 5567.87 9.14844 9.41016 MemoryCheck: module PoolOutputModule:SKIMStreamLogError VSIZE 8726.72 0 RSS 6058.22 9.41016 9.48047 MemoryCheck: module HitPairEDProducer:detachedQuadStepHitDoublets VSIZE 8726.72 0 RSS 6555.3 9.48047 9.53516 MemoryCheck: module HitPairEDProducer:detachedQuadStepHitDoublets VSIZE 7830.72 0 RSS 5688.38 9.53516 12.1641 MemoryCheck: module HitPairEDProducer:detachedQuadStepHitDoublets VSIZE 8726.72 0 RSS 6302.84 12.1641 12.3477 MemoryCheck: module PoolOutputModule:MINIAODoutput VSIZE 8726.72 0 RSS 5757.42 12.3477 15.6133 MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 7830.72 0 RSS 5667.5 15.6133 17.7539 MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 7830.67 0 RSS 5546.21 17.7539 25.1328 MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 9174.81 0 RSS 6644.27 25.1328 153.078 MemoryCheck: module PoolOutputModule:MINIAODoutput VSIZE 8726.72 0 RSS 5941.36 153.078


there are of course also negative steps

[innocent@olspx-01 tauCrash]$ grep 'RSS' fullJob1T.log | awk '{print $9, $_}' | sort -n | head -n 20 -242.188 MemoryCheck: module PhotonMonitor:DiPhoton10sminlt0p1_monitoring VSIZE 7830.71 0.0078125 RSS 5315.45 -242.188 -116.082 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 10222.8 1048 RSS 6700.23 -116.082 -72.3047 MemoryCheck: module CkfTrackCandidateMaker:pixelLessStepTrackCandidates VSIZE 8726.72 896 RSS 5728.98 -72.3047 -60.5078 MemoryCheck: module TopMonitor:diMu9Ele9CaloIdLTrackIdL_muleg VSIZE 7830.72 0.0078125 RSS 5644.09 -60.5078 -22.0352 MemoryCheck: module CandPtrSelector:pfElectrons VSIZE 7830.7 0.0234375 RSS 5528.34 -22.0352 -14.4844 MemoryCheck: module OutsideInMuonSeeder:muonSeededSeedsOutIn VSIZE 7062.67 0.0078125 RSS 5518.09 -14.4844 -2.03906 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadrupletsPreSplitting VSIZE 7830.67 768 RSS 5528.46 -2.03906 -0.738281 MemoryCheck: module ObjMonitor:HMesonGammamonitoring VSIZE 7830.71 0.0117188 RSS 5538.82 -0.738281 0.00390625 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetJetTagsSlimmedAK8DeepTags VSIZE 10222.8 0 RSS 6702.37 0.00390625 0.00390625 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetJetTagsSlimmedAK8DeepTags VSIZE 7830.71 0 RSS 5374.23 0.00390625 0.00390625 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetJetTagsSlimmedAK8DeepTags VSIZE 7830.71 0 RSS 5487.02 0.00390625 0.00390625 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetMassRegressionJetTagsSlimmedAK8DeepTags VSIZE 10222.8 0 RSS 6912.33 0.00390625 0.00390625 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetMassRegressionJetTagsSlimmedAK8DeepTags VSIZE 7830.71 0 RSS 5688.64 0.00390625 0.00390625 MemoryCheck: module BoostedJetONNXJetTagsProducer:pfParticleNetMassRegressionJetTagsSlimmedAK8DeepTags VSIZE 8726.72 0 RSS 6389.01 0.00390625 0.00390625 MemoryCheck: module CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets VSIZE 7062.66 0 RSS 5532.58 0.00390625 0.00390625 MemoryCheck: module CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets VSIZE 8726.72 0 RSS 6133.32 0.00390625 0.00390625 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 7830.71 0 RSS 5577.62 0.00390625 0.00390625 MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadruplets VSIZE 8726.72 0 RSS 6575.53 0.00390625 0.00390625 MemoryCheck: module CAHitQuadrupletEDProducer:lowPtQuadStepHitQuadruplets VSIZE 10222.8 0 RSS 6778.4 0.00390625 0.00390625 MemoryCheck: module CAHitQuadrupletEDProducer:lowPtQuadStepHitQuadruplets VSIZE 8726.72 0 RSS 6113.58 0.00390625

VinInn commented 1 year ago

and this is the plot for one thread image

one would say that there is a tiny leak somewhere... (is ~1.5GB of increase in ~16K events)

VinInn commented 1 year ago

I run the job (4 threads) with cmsRunTC the "leak' is stil lthere. It just uses less memory apparently

[innocent@olspx-01 tauCrash]$ grep RSS fullJobTC.log | head
MemoryCheck: module PFPileUp:pfPileUpJMENoHF VSIZE 7608.57 0 RSS 6280.66 0.00390625
MemoryCheck: module MultShiftMETcorrInputProducer:patPFMetTxyCorrNoHF VSIZE 7608.57 0 RSS 6280.69 0.03125
MemoryCheck: module SeedingLayersEDProducer:stripPairElectronSeedLayers VSIZE 7608.57 0 RSS 6280.93 0.246094
MemoryCheck: module HLTFiltersDQMonitor:hltFiltersDQM VSIZE 7608.57 0 RSS 6280.94 0.00390625
MemoryCheck: module SiPixelClusterProducer:siPixelClustersPreSplitting@cpu VSIZE 7608.57 0 RSS 6281.1 0.160156
MemoryCheck: module GsfTrackProducer:electronGsfTracks VSIZE 7618.16 9.59375 RSS 6286.13 5.03125
MemoryCheck: module PFTrackProducer:pfTrack VSIZE 7618.16 0 RSS 6290.13 4
MemoryCheck: module CAHitQuadrupletEDProducer:initialStepHitQuadrupletsPreSplitting VSIZE 7618.16 0 RSS 6290.63 0.5
MemoryCheck: module PFElecTkProducer:lowPtGsfElePfGsfTracks VSIZE 7618.16 0 RSS 6290.88 0.253906
MemoryCheck: module SiStripRecHitConverter:siStripMatchedRecHits VSIZE 7644.27 26.1094 RSS 6311.76 20.875
[innocent@olspx-01 tauCrash]$ grep RSS fullJobTC.log | tail
MemoryCheck: module TrackProducer:lowPtTripletStepTracks VSIZE 9512.59 0 RSS 8183.92 2
MemoryCheck: module SeedCombiner:pixelPairStepSeeds VSIZE 9512.59 0 RSS 8184.02 0.101562
MemoryCheck: module MkFitSeedConverter:detachedTripletStepTrackCandidatesMkFitSeeds VSIZE 9512.59 0 RSS 8184.05 0.0273438
MemoryCheck: module PATPackedCandidateProducer:packedPFCandidates VSIZE 9512.59 0 RSS 8184.05 0.00390625
MemoryCheck: module SiPixelPhase1Clusters:SiPixelPhase1ClustersAnalyzer VSIZE 9512.59 0 RSS 8184.05 0.00390625
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds VSIZE 9512.59 0 RSS 8184.07 0.0195312
MemoryCheck: module MuonIdProducer:earlyMuons VSIZE 9512.59 0 RSS 8184.09 0.015625
MemoryCheck: module HitPairEDProducer:detachedQuadStepHitDoublets VSIZE 9512.59 0 RSS 8184.19 0.0976562
MemoryCheck: module CandSecondaryVertexProducer:pfInclusiveSecondaryVertexFinderTagInfosPuppi VSIZE 9512.59 0 RSS 8184.21 0.0195312
MemoryCheck: module PFProducer:particleFlowTmp VSIZE 9512.59 0 RSS 8184.21 0.00390625
[innocent@olspx-01 tauCrash]$
VinInn commented 1 year ago

plot image

the very large drop is in the middle of nowhere (VSS does not change, cannot exclude is moved to swap)

Begin processing the 1925th record. Run 357899, Event 201822059, LumiSection 149 on stream 3 at 16-Jul-2023 10:41:04.300 CEST
%MSG-w MemoryCheck:  CkfTrackCandidateMaker:pixelLessStepTrackCandidates  16-Jul-2023 10:41:04 CEST Run: 357899 Event: 202644708
MemoryCheck: module CkfTrackCandidateMaker:pixelLessStepTrackCandidates VSIZE 8872.13 0 RSS 7605.2 0.0390625
%MSG

.........

Begin processing the 2226th record. Run 357899, Event 201928663, LumiSection 149 on stream 1 at 16-Jul-2023 10:46:54.908 CEST
Begin processing the 2227th record. Run 357899, Event 201965148, LumiSection 149 on stream 0 at 16-Jul-2023 10:46:55.090 CEST
%MSG-w L1TTauOffline:  L1TTauOffline:l1tTauOfflineDQMEmu  16-Jul-2023 10:46:55 CEST Run: 357899 Event: 201695216
invalid collection: reco::PFTauDiscriminator
%MSG
%MSG-w L1TTauOffline:  L1TTauOffline:l1tTauOfflineDQM 16-Jul-2023 10:46:55 CEST  Run: 357899 Event: 201695216
invalid collection: reco::PFTauDiscriminator
%MSG
Begin processing the 2228th record. Run 357899, Event 201968796, LumiSection 149 on stream 3 at 16-Jul-2023 10:46:55.448 CEST
%MSG-w MemoryCheck:  CkfTrackCandidateMaker:lowPtQuadStepTrackCandidates  16-Jul-2023 10:46:56 CEST Run: 357899 Event: 201928663
MemoryCheck: module CkfTrackCandidateMaker:lowPtQuadStepTrackCandidates VSIZE 8891.51 19.375 RSS 7070.07 -535.129
%MSG
%MSG-w MemoryCheck:  CAHitTripletEDProducer:detachedTripletStepHitTriplets  16-Jul-2023 10:46:56 CEST Run: 357899 Event: 201925040
MemoryCheck: module CAHitTripletEDProducer:detachedTripletStepHitTriplets VSIZE 8891.51 0 RSS 7070.58 0.515625
VinInn commented 1 year ago

cmsRun (jemalloc) report

TimeReport> Time report complete in 19861.3 seconds
 Time Summary:
 - Min event:   0.352757
 - Max event:   20.3359
 - Avg event:   4.84248
 - Total loop:  19821.7
 - Total init:  39.5779
 - Total job:   19861.3
 - EventSetup Lock: 0
 - EventSetup Get:  0
 Event Throughput: 0.820413 ev/s
 CPU Summary:
 - Total loop:     78646.4
 - Total init:     32.9849
 - Total extra:    0
 - Total children: 0.041066
 - Total job:      78679.4
 Processing Summary:
 - Number of Events:  16262
 - Number of Global Begin Lumi Calls:  2
 - Number of Global Begin Run Calls: 1

MemoryReport> Peak virtual size 13701.3 Mbytes
 Key events increasing vsize:
[281] run: 357899 lumi: 149 event: 201572759  vsize = 11767.2 deltaVsize = 384 rss = 7341.03 delta = -18.6836
[2] run: 357899 lumi: 149 event: 202107090  vsize = 10915.1 deltaVsize = 416.035 rss = 7359.71 delta = 100.586
[1078] run: 357899 lumi: 149 event: 201586305  vsize = 12407.2 deltaVsize = 512 rss = 7915.91 delta = 556.203
[12904] run: 357899 lumi: 150 event: 203999990  vsize = 13701.3 deltaVsize = 448 rss = 8751.31 delta = 835.395
[3051] run: 357899 lumi: 149 event: 202024379  vsize = 13061.2 deltaVsize = 256 rss = 8192.34 delta = 276.422
[12906] run: 357899 lumi: 150 event: 202814411  vsize = 13701.3 deltaVsize = 0 rss = 8686.52 delta = -64.793
[12905] run: 357899 lumi: 150 event: 202694028  vsize = 13701.3 deltaVsize = 0 rss = 8705.16 delta = -46.1445
[12904] run: 357899 lumi: 150 event: 203999990  vsize = 13701.3 deltaVsize = 448 rss = 8751.31 delta = -14.6133

cmsRunTC report

imeReport> Time report complete in 19843.7 seconds
 Time Summary:
 - Min event:   0.377807
 - Max event:   23.179
 - Avg event:   4.826
 - Total loop:  19809.3
 - Total init:  34.3873
 - Total job:   19843.7
 - EventSetup Lock: 0
 - EventSetup Get:  0
 Event Throughput: 0.820927 ev/s
 CPU Summary:
 - Total loop:     78809.1
 - Total init:     32.4721
 - Total extra:    0
 - Total children: 0.039804
 - Total job:      78841.6
 Processing Summary:
 - Number of Events:  16262
 - Number of Global Begin Lumi Calls:  2
 - Number of Global Begin Run Calls: 1

MemoryReport> Peak virtual size 9512.59 Mbytes
 Key events increasing vsize:
[2] run: 357899 lumi: 149 event: 202107090  vsize = 7659.48 deltaVsize = 50.918 rss = 6337.13 delta = 56.4766
[190] run: 357899 lumi: 149 event: 202176992  vsize = 8125.55 deltaVsize = 61.0391 rss = 6819.08 delta = 481.949
[900] run: 357899 lumi: 149 event: 202115248  vsize = 8284.76 deltaVsize = 121.602 rss = 7001.26 delta = 182.18
[7974] run: 357899 lumi: 150 event: 202699090  vsize = 9468.71 deltaVsize = 113.117 rss = 8179.31 delta = 1178.05
[3674] run: 357899 lumi: 149 event: 201692312  vsize = 9247.53 deltaVsize = 95.75 rss = 7670.17 delta = 668.91
[14528] run: 357899 lumi: 150 event: 203606119  vsize = 9512.59 deltaVsize = 0 rss = 8184.05 delta = 0.207031
[14527] run: 357899 lumi: 150 event: 203649894  vsize = 9512.59 deltaVsize = 0 rss = 8183.7 delta = -0.148438
[14526] run: 357899 lumi: 150 event: 203646232  vsize = 9512.59 deltaVsize = 43.875 rss = 8183.84 delta = 4.53125
VinInn commented 1 year ago

so maybe better to switch to TCMalloc?

VinInn commented 1 year ago

igpprof for 1000 events in http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_12_4_14_patch1_el8_amd64_gcc10_memTot/self http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_12_4_14_patch1_el8_amd64_gcc10_memLive/self http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_12_4_14_patch1_el8_amd64_gcc10_memPeak/self

nothing particularly obvious (root, lzma, tensorflow)

VinInn commented 1 year ago

most of the "cling" allocation comes from

[HLTMuonOfflineAnalyzer::dqmBeginRun(edm::Run const&, edm::EventSetup const&)](http://innocent.web.cern.ch/innocent/perfResults/igprof-navigator/config__CMSSW_12_4_14_patch1_el8_amd64_gcc10_memLive/75)
VinInn commented 1 year ago

cmsRunTC job is fully reproducible including the -500MB drop

Dr15Jones commented 1 year ago

@VinInn for your igprof values, are you getting one report for the entire job or are you using IgProfService to dump the state of the job periodically? I'm presently running a job on cmsdev31 which is dumping igprof memory logs every 100 events (running single threaded).

Dr15Jones commented 1 year ago

So running the job with the IgProfService dumping MEM_LIVE every 100 events, it looks like xrootd has a 500byte/event memory leak coming from its use of CRYPTO_malloc. I don't know if that rate of a leak is large enough to account for what is being seen.

Dr15Jones commented 1 year ago

So running the job on cmsdev31, I see from the log that the RSS increased by 100MB over 300 events, so averaging 300kB/event memory increase. This is much larger than the amount igprof MEM_LIVE is reporting.

VinInn commented 1 year ago

my Igprof were for 1000 events w/o the service (and I was runing on local files)

VinInn commented 1 year ago

the xrood leak may explain why the memory kept increasing while xrootd was trying to open the second file in the original job reported by @drkovalskyi

Dr15Jones commented 1 year ago

Then out of curiosity, did anyone try to download the files locally and run without xrootd?

VinInn commented 1 year ago

All my tests are with local files (I'm running on a machine w/o eos, grid, afs)

VinInn commented 1 year ago

The "culprit" is sc8 that is more aggressive in creating THP. more evidence to come...

VinInn commented 1 year ago

btw out of '15GB' of raw data we write all this

innocent@olspx-01 tauCrash]$ edmFileUtil SKIMStreamLogError.root
SKIMStreamLogError.root
SKIMStreamLogError.root (1 runs, 2 lumis, 248 events, 1155649412 bytes)
[innocent@olspx-01 tauCrash]$ edmFileUtil AODoutput.root
AODoutput.root
AODoutput.root (1 runs, 2 lumis, 16262 events, 6340100988 bytes)
[innocent@olspx-01 tauCrash]$ edmFileUtil MINIAODoutput_zstd.root
MINIAODoutput_zstd.root
MINIAODoutput_zstd.root (1 runs, 2 lumis, 16262 events, 898274867 bytes)

what happens if we get 100 times more errors?

Dr15Jones commented 1 year ago

So cmsdev31 (which is what I've been using) is an sl7 machine. After 2800 events using 1 thread (my job failed after that do to a xrootd permission failure), the memory increased by 1.2GB.

Doing a difference on igprof MEM_LIVE output from event 110 to event 2810 I see

Flat profile (self different entries only)

delta %         Self        Calls  Function
  10.13    8'388'608            1  onnxruntime::utils::DefaultAlloc(unsigned long) [9]
  34.09      230'612        1'848  CRYPTO_malloc [34]

Where the onnx call appeared somewhere after event 1000. The amount of extra memory reported by igprof is tiny compared to what the OS is reporting.

VinInn commented 1 year ago

It seems to me that there are at least two different issue: 1) the baseline that is large and can be very large on some machines (and apparenlty scales up with the number of threads) 2) some sort of memory grow in particular at the beginning of the run (in the first few thousands events)

this is what I collected so far using this "logifle parsising script"

echo $1
echo 'rss = np.array(['
egrep 'RSS|processing' $1 | grep -B1 'processing' | grep 'RSS' -A1 | grep 'RSS' | awk '{print $8}' | tr '\n' ','
echo '])'
echo 'event = np.array(['
egrep 'RSS|processing' $1 | grep -B1 'processing' | grep 'RSS' -A1 | grep 'processing' | awk '{print $4}' | tr '\n' ',' | sed 's/[a-z]//g'
echo '])'

image

I will later run asyncrhonously a OS mem monitor

#!/usr/bin/csh
while (1)
set pid = `ps -af | grep $USER | grep cmsRun | grep -v grep | awk '{printf $2" "}'`
grep Anon /proc/meminfo;
awk '/^Rss/ {pss += $2} END {print pss}' /proc/$pid/smaps ; awk '/^Pss/ {pss += $2} END {print pss}' /proc/$pid/smaps ; ps -v $pid
sleep 10
end
Dr15Jones commented 1 year ago

@VinInn nice plot!

some sort of memory grow in particular at the beginning of the run (in the first few thousands events)

As a guess, this is probably from ROOT as it is filling baskets. For some output types we tell ROOT to flush after 900 events. So after 900 events it would flush small buffers (and therefore set its upper limit for those basket sizes for the rest of the job).

Dr15Jones commented 1 year ago

As for the amount of memory at startup being proportional to the number of concurrent events, that would most likely be from the fast that any edm::stream modules in the job get N copies where N is the number of concurrent events. For my single threaded job the MEM_LIVE for the edm::stream::EDProducers that hold the most memory are

[189]       9.8   63'069'413            0 / 63'069'413         2'262                edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*)
            5.8  ...........   37'748'736 / 37'748'736             6 / 6              BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) [308]
            2.6  ...........   16'777'216 / 16'777'216             3 / 3              DeepFlavourONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) [483]
            0.9  ...........    6'091'960 / 6'091'960            260 / 260            deep_tau::DeepTauBase::produce(edm::Event&, edm::EventSetup const&) [811]
            0.2  ...........    1'048'576 / 1'048'576              1 / 1              pat::PATMuonProducer::produce(edm::Event&, edm::EventSetup const&) [1541]
            0.1  ...........      754'525 / 754'525              146 / 146            TrackMVAClassifierBase::produce(edm::Event&, edm::EventSetup const&) [1695]
            0.0  ...........      157'986 / 157'986              116 / 116            DeepMETProducer::produce(edm::Event&, edm::EventSetup const&) [2659]
            0.0  ...........      147'120 / 147'120               27 / 27             GEDPhotonProducer::produce(edm::Event&, edm::EventSetup const&) [2698]
            0.0  ...........      114'836 / 114'836               45 / 45             GsfElectronProducer::produce(edm::Event&, edm::EventSetup const&) [2809]

a edm::stream::EDAnalyzer using a lot of memory is

           22.4  ...........  145'164'967 / 145'165'055        9'407 / 9'408          edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) [93]
[94]       22.4  145'164'967            0 / 145'164'967        9'407                DQMEDAnalyzer::beginRun(edm::Run const&, edm::EventSetup const&)
           22.4  ...........  145'164'967 / 145'164'967        9'407 / 9'407          HLTMuonOfflineAnalyzer::dqmBeginRun(edm::Run const&, edm::EventSetup const&) [95]

as for modules that allocate lots of memory in their constructor

[46]       51.7  334'380'432            0 / 334'380'432       65'039                edm::Maker::makeModule(edm::MakeModuleParams const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleD
escription const&)>&) const
           18.4  ...........  119'287'934 / 119'287'934       29'399 / 29'399         edm::stream::ProducingModuleAdaptorBase<edm::stream::EDFilterBase>::doPreallocate(edm::PreallocationConfiguration const&) [117]
           17.2  ...........  111'331'415 / 111'331'415       17'547 / 17'547         edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doPreallocate(edm::PreallocationConfiguration const&) [130]
            5.7  ...........   36'711'302 / 36'711'302         3'979 / 3'979          edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase>::registerProductsAndCallbacks(edm::ProductRegistry*) [315]
            4.3  ...........   27'910'727 / 27'910'727         8'730 / 8'730          edm::WorkerMaker<TopSingleLeptonDQM>::makeModule(edm::ParameterSet const&) const [347]
            4.1  ...........   26'215'012 / 26'215'012            80 / 80             edm::WorkerMaker<BoostedJetONNXJetTagsProducer>::makeModule(edm::ParameterSet const&) const [363]
            1.0  ...........    6'291'624 / 6'291'624             18 / 18             edm::WorkerMaker<DeepFlavourONNXJetTagsProducer>::makeModule(edm::ParameterSet const&) const [799]
            0.5  ...........    3'145'848 / 3'145'848             18 / 18             edm::WorkerMaker<DeepDoubleXONNXJetTagsProducer>::makeModule(edm::ParameterSet const&) const [1069]
            0.3  ...........    1'828'712 / 1'828'712             68 / 68             edm::WorkerMaker<(anonymous namespace)::DeepFlavourJetTagsProducer>::makeModule(edm::ParameterSet const&) const [1250]

where the stream 'adapter' constructors are actually calling

[117]      18.4  119'287'934            0 / 119'287'934       29'399                edm::stream::ProducingModuleAdaptorBase<edm::stream::EDFilterBase>::doPreallocate(edm::PreallocationConfiguration const&)
            9.5  ...........   61'660'282 / 61'660'282        13'540 / 13'540         edm::stream::ProducingModuleAdaptor<ByMultiplicityEventFilter<MultiplicityPair<ClusterSummarySingleMultiplicity, ClusterSummarySingleMultiplicity> >, edm::stream::EDFilterBase, edm::stream::EDFilterAdaptorBase>::setupStreamModules() [190]
            4.1  ...........   26'616'765 / 26'616'765         3'783 / 3'783          edm::stream::ProducingModuleAdaptor<SingleObjectSelectorBase<std::vector<reco::Vertex, std::allocator<reco::Vertex> >, StringCutObjectSelector<reco::Vertex, false>, edm::stream::EDFilter<>, [cut]>, edm::stream::EDFilterBase, edm::stream::EDFilterAdaptorBase>::setupStreamModules() [358]
            3.3  ...........   21'316'024 / 21'316'024         6'907 / 6'907          edm::stream::ProducingModuleAdaptor<SingleObjectSelectorBase<edm::View<reco::Muon>, StringCutObjectSelector<reco::Muon, false>, [cut], edm::stream::EDFilterBase, edm::stream::EDFilterAdaptorBase>::setupStreamModules() [424]
            0.6  ...........    4'092'210 / 4'092'210            970 / 970            edm::stream::ProducingModuleAdaptor<SingleObjectSelectorBase<std::vector<reco::Track, std::allocator<reco::Track> >, StringCutObjectSelector<reco::Track, false>, [cut], edm::stream::EDFilterBase, edm::stream::EDFilterAdaptorBase>::setupStreamModules() [947]
            0.5  ...........    3'470'937 / 3'470'937            652 / 652            edm::stream::ProducingModuleAdaptor<SingleObjectSelectorBase<std::vector<pat::TriggerObjectStandAlone, std::allocator<pat::TriggerObjectStandAlone> >, StringCutObjectSelector<pat::TriggerObjectStandAlone, false>,  [cut], edm::stream::EDFilterBase, edm::stream::EDFilterAdaptorBase>::setupStreamModules() [1002]
            0.3  ...........    1'855'714 / 1'855'714          1'051 / 1'051          edm::stream::ProducingModuleAdaptor<SingleObjectSelectorBase<std::vector<reco::PFJet, std::allocator<reco::PFJet> >, StringCutObjectSelector<reco::PFJet, false>, [cut], edm::stream::EDFilterBase, edm::stream::EDFilterAdaptorBase>::setupStreamModules() [1245]

and

[130]      17.2  111'331'415            0 / 111'331'415       17'547                edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::doPreallocate(edm::PreallocationConfiguration const&)
            9.8  ...........   63'652'294 / 63'652'294         8'104 / 8'104          edm::stream::ProducingModuleAdaptor<pat::PATIsolatedTrackProducer, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::setupStreamModules() [181]
            4.0  ...........   25'808'933 / 25'808'933         3'423 / 3'423          edm::stream::ProducingModuleAdaptor<VersionedIdProducer<edm::Ptr<reco::Photon>, VersionedSelector<edm::Ptr<reco::Photon> > >, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::setupStreamModules() [368]
            2.3  ...........   15'088'998 / 15'088'998         2'153 / 2'153          edm::stream::ProducingModuleAdaptor<RecoTauPiZeroProducer, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::setupStreamModules() [501]
            0.6  ...........    3'898'264 / 3'898'264            344 / 344            edm::stream::ProducingModuleAdaptor<DeepTauId, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::setupStreamModules() [956]
            0.2  ...........    1'100'742 / 1'100'742            174 / 174            edm::stream::ProducingModuleAdaptor<RecoTauProducer, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::setupStreamModules() [1454]

It looks like the very large use of the 'object selector' code is driving the large memory usage. That memory is from clang's JIT system.

VinInn commented 1 year ago

On Jul 18, 2023, at 5:20 PM, Chris Jones @.***> wrote:

It looks like the very large use of the 'object selector' code is driving the large memory usage. That memory is from clang's JIT system.

Yep. I noticed in particular this:

https://cmssdt.cern.ch/dxr/CMSSW/source/DQMOffline/Trigger/plugins/HLTMuonOfflineAnalyzer.cc#105 that is filling a large number of https://cmssdt.cern.ch/dxr/CMSSW/source/DQMOffline/Trigger/interface/HLTMuonMatchAndPlot.h?from=HLTMuonMatchAndPlot&case=true#

in particular StringCutObjectSelector triggerSelector_;

v.

Dr15Jones commented 1 year ago

As a guess, this is probably from ROOT as it is filling baskets. For some output types we tell ROOT to flush after 900 events. So after 900 events it would flush small buffers (and therefore set its upper limit for those basket sizes for the rest of the job).

I looked at the configuration for all the output modules. Almost all are configured to flush after the compressed output buffer reaches 5MB. The AOD file is set to flush after about 300MB of compressed buffer size has been reached.

The MiniAOD is the only one set to flush after 900 events, rather than based on a size of the compressed buffer.

Dr15Jones commented 1 year ago

The use of the cut parser in the HLTMuonOfflineAnalyzer have the following configuration options. Notice, the cuts are all the same and trivial.

cms.EDProducer("HLTMuonOfflineAnalyzer",
[cut]
    probeParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    ),
    requiredTriggers = cms.untracked.vstring('HLT_Mu17_TrkIsoVVL_v'),
    targetParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    )
)

cms.EDProducer("HLTMuonOfflineAnalyzer",
[cut]
    probeParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    ),
    requiredTriggers = cms.untracked.vstring('HLT_Mu19_TrkIsoVVL_v'),
    targetParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    )
)

cms.EDProducer("HLTMuonOfflineAnalyzer",
[cut]
    probeParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    ),
    requiredTriggers = cms.untracked.vstring(),
    targetParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    )
)

cms.EDProducer("HLTMuonOfflineAnalyzer",
[cut]
    probeParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    ),
    requiredTriggers = cms.untracked.vstring(),
    targetParams = cms.PSet(
        d0Cut = cms.untracked.double(2.0),
        hltCuts = cms.untracked.string('abs(eta) < 2.4'),
        recoCuts = cms.untracked.string('isGlobalMuon && abs(eta) < 2.4'),
        z0Cut = cms.untracked.double(25.0)
    )
)
Dr15Jones commented 1 year ago

So I took a look at the configuration for these jobs. It has

this does not appear to be some simple re-reco.

Dr15Jones commented 1 year ago

@VinInn in the plot with the different memory managers, the orange line is labeled 'jemalloc 4 threads lxplus803 (xrootd)' but is is really 4 threads? It seems more likely to be 1 thread.

VinInn commented 1 year ago

Can somebody point us to a config of a failingn rereco?

VinInn commented 1 year ago

@VinInn in the plot with the different memory managers, the orange line is labeled 'jemalloc 4 threads lxplus803 (xrootd)' but is is really 4 threads? It seems more likely to be 1 thread.

This was the very first job I run in the log file I read

%MSG-i ThreadStreamSetup:  (NoModuleName) 13-Jul-2023 10:10:15 CEST pre-events
setting # threads 4
setting # streams 4

and I confirm

[innocent@lxplus803 tauCrash]$ grep RSS  fullJob.log | head
MemoryCheck: module DigiTask:digiTask VSIZE 9849.67 0 RSS 5689.43 6.92969
[innocent@lxplus803 tauCrash]$ grep RSS fullJob.log | tail
MemoryCheck: module PrimaryVertexProducer:firstStepPrimaryVerticesPreSplitting VSIZE 14097.8 0 RSS 6009.75 2.32031

it is true that I cannot fully reproduce it. run yesterday again `` [innocent@lxplus803 tauCrash]$ grep RSS memJobJe7.log | head MemoryCheck: module FastjetJetProducer:ak4PFJetsCHSNoHF VSIZE 9694.95 0 RSS 6328.03 0.421875 [innocent@lxplus803 tauCrash]$ grep RSS memJobJe7.log | tail MemoryCheck: module CAHitTripletEDProducer:lowPtTripletStepHitTriplets VSIZE 14023.6 0 RSS 5664.54 5.03906



somehow at somepoint it sends memory to swap  (at event ~4000 it drops of 2GB!)
drkovalskyi commented 1 year ago

Can somebody point us to a config of a failingn rereco?

Could you clarify what you mean? https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/ is a failing one. Do you need the cmsDriver command that configured the workflow or do you need more failing job examples?

VinInn commented 1 year ago

That's Ok. thanks @drkovalskyi . Just confirming that what we are using is the rereco config (with its 12 output files etc)

VinInn commented 1 year ago

@drkovalskyi It would be useful to have a failing log for a job that does not crash while opening a new file as this one (crash in the middle of processing)

VinInn commented 1 year ago

OS settings:

here we compare slc7 with el8 with (default) and without THP for JeMalloc (default) and TCMallloc (ZSTD as at some point changed from LZMA to test and never reverted back)

noTHP means
 echo never > /sys/kernel/mm/transparent_hugepage/enabled
 echo never > /sys/kernel/mm/transparent_hugepage/defrag

image

VinInn commented 1 year ago

Another road which could be useful to pursue is to try to identify memory that is used only at initialization and then never accessed anymore (I'm sure a lot of conditions and geometry)

on lxplus803 it continues to swap

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+  P   SWAP    DATA nMaj nDRT WCHAN      COMMAND
1775702 innocent  20   0   13.9g   3.9g 241160 R 379.7  13.8 978:49.84  9   3.0g   12.8g  59k    0 -          cmsRun

and vmstat shows much more memory swapped out then swapped in. So it i highly probable that of the total 7GB of memory that is usually resident only 4GB or so is actually used in each event...

drkovalskyi commented 1 year ago

@drkovalskyi It would be useful to have a failing log for a job that does not crash while opening a new file as this one (crash in the middle of processing)

@VinInn sorry for the delay. It's getting harder to find these jobs because we are resubmitting restricting sites. Here are few examples: 1) https://cms-unified.web.cern.ch/cms-unified/joblogs/wangz_ACDC0_Run2022E_EGamma_27Jun2023_230709_174641_1131/50660/DataProcessing/0a62efc5-1f5c-4cac-b06e-02219576edb6-470-0-logArchive/ Reaches 10GB and fails processing the first file on 31st event

2) https://cms-unified.web.cern.ch/cms-unified/joblogs/wangz_ACDC0_Run2022E_EGamma_27Jun2023_230709_174641_1131/50660/DataProcessing/ Similar to 1)

If you do need multi-file example, let me know.

VinInn commented 1 year ago

the first is pretty clear some of the seconds are these bizzare PSS > RSS

INFO:root:RUNNING SCRAM SCRIPTS
INFO:root:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
INFO:root:PSS: 2476858; RSS: 2390352; PCPU: 67.0; PMEM: 0.9
INFO:root:PSS: 9878110; RSS: 7297976; PCPU: 164; PMEM: 2.7
INFO:root:PSS: 9586476; RSS: 7347000; PCPU: 236; PMEM: 2.7
INFO:root:PSS: 9883890; RSS: 7388092; PCPU: 273; PMEM: 2.8
INFO:root:PSS: 10355298; RSS: 7431820; PCPU: 294; PMEM: 2.8
ERROR:root:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 10355 MB

and the machine is an AMD 12 year old

VinInn commented 1 year ago

this is the plot compared to the other job (updeted with a second similar job

image

files are

store/data/Run2022E/EGamma/RAW/v1/000/360/125/00000/40b714dc-77f0-47d2-865e-ecc38bd53933.root
and
/store/data/Run2022E/EGamma/RAW/v1/000/359/751/00000/e99a04e2-d3f4-400f-b42b-3fb3240761d0.root

we can try to understand if really something happens at event 3000

VinInn commented 1 year ago

do not hold your breath: from a quick look (skipping 3K events) it does not peek up. will post plots when the jobs will finish.

drkovalskyi commented 1 year ago

The re-reco campaign has good progress with just 4 cores 10GB. We have black-listed sites that had large rate of failures. So at the end it can be still a site problem.

drkovalskyi commented 1 year ago

The two new examples were likely executed at T2_US_Nebraska.

drkovalskyi commented 1 year ago

Sorry, they both ran at T1_DE_KIT

VinInn commented 1 year ago

I run on the first file 4 times as in production skipping first 3000 events writing only AOD writing only MiniAOD (and skipping first 3000 events writing only Skim writing only DQM

here is the plot image

notice how staritng 3000 events later does not change the shape note also that I changed the configuration to use ZSTD and 20MB buffer limit when I was doing some tests. I've now restored the default LZMA and 314MB buffer limit for AOD as in production: the baseline now is the same as in the crashing job.

VinInn commented 1 year ago

conclusion: behaviour not reproducible in the detail. Not correlated to the event content.

drkovalskyi commented 1 year ago

Should we try running in Nebraska or KIT on hardware similar to what was used for the failed jobs?

VinInn commented 1 year ago

Should we try running in Nebraska or KIT on hardware similar to what was used for the failed jobs?

that would be optimal

drkovalskyi commented 1 year ago

Ok, let me check if we can arrange that

sextonkennedy commented 1 year ago

The re-reco campaign has good progress with just 4 cores 10GB. We have black-listed sites that had large rate of failures. So at the end it can be still a site problem.

It is not a site problem if you say you will use 10GB and then use more then that, as Vincenzo showed these jobs do. Throwing blame on sites, will not help solve the problem.

drkovalskyi commented 1 year ago

I thought the behaviour is not reproduced for the latest failures where only one file is open. I'm not blaming sites. I'm just suggesting to try in the environment where we've seen most of the failures.

VinInn commented 1 year ago

We are marginal: with the 300MB AOD output-buffer full we need ~8.5 GB (I have updated the plot above). Then at some point memory increases again. In the killed job this is fast and abrupt and not reproduced. Still one may notice that in my test memory increases fast to 9GB toward the end of the job and then plateau. This is not understood.