Sudden spike in memory consumption when running NanoAOD

ktht commented 5 years ago

Issue originally described here: https://github.com/cms-nanoAOD/cmssw/issues/373 It doesn't seem to be specific to NanoAOD framework, though. Literally any module or task causes the memory to jump between the events, and the jump is proportional to the memory consumed by the tasks and modules. Looks like some interruption triggered by the offending event causes the workflow to re-allocate memory for the modules.

Original post:

While profiling DeepTau IDv2 implementation, I came across a problem where one particular event triggers ~500MB increase in memory consumption (both VSIZE and RSS). This is clearly a problem because the RSS memory goes above 2000MB which is the default limit in CRAB jobs.

The problem is present in vanilla CMSSW 10_2_10 release and in CMSSW 10_2_14 release based on this fork (no bells or whistles, just followed NanoAOD Workbook and merged PRs 26370+ 26794 & added missing data files). The problem happens with event 1:97:164559 (or 900th event) in file [*]. It's always reproducable with:
cmsDriver.py nanoAOD --step=NANO --mc --era=Run2_2017,run2_nanoAOD_94XMiniAODv2 \
  --conditions=102X_mc2017_realistic_v6 --no_exec --fileout=tree.root \
  --number=2 --eventcontent NANOAODSIM --datatier NANOAODSIM --nThreads=1 \
  --python_filename=cfg.py \
  --customise_commands='process.MessageLogger.cerr.FwkReport.reportEvery = 1; process.source.skipEvents=cms.untracked.uint32(899)' \
  --filein=root://cms-xrd-global.cern.ch//store/mc/RunIIFall17MiniAODv2/ttHJetToNonbb_M125_TuneCP5_13TeV_amcatnloFXFX_madspin_pythia8/MINIAODSIM/PU2017_12Apr2018_new_pmx_94X_mc2017_realistic_v14-v1/100000/7C60AC2B-E76F-E811-9D60-0025905B860C.root

cmsRun cfg.py &> log.txt
(Although I used and recommend using local file to exclude effects due to network.) The job runs over 2 events: 1st event (899th in the file) is to showcase normal memory consumption and the 2nd event (900th) is the triggering event. SimpleMemoryCheck [$] prints the following
MemoryCheck: event : VSIZE 2060.99 0 RSS 1648.28 0 HEAP-ARENA [ SIZE-BYTES 0 N-UNUSED-CHUNKS 1 TOP-FREE-BYTES 0 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 0 HEAP-UNUSED-BYTES 0
MemoryCheck: event : VSIZE 2637.19 0 RSS 2104.69 0 HEAP-ARENA [ SIZE-BYTES 0 N-UNUSED-CHUNKS 1 TOP-FREE-BYTES 0 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 0 HEAP-UNUSED-BYTES 0
What I find weird, though, is that when I try to run the job with cmsRunGlibC instead of cmsRun to see the memory consumption per module, I get slightly lower numbers:
MemoryCheck: event : VSIZE 1928.11 0 RSS 1472.46 0 HEAP-ARENA [ SIZE-BYTES 949719040 N-UNUSED-CHUNKS 11 TOP-FREE-BYTES 2750144 ] HEAP-MAPPED [ SIZE-BYTES 329121792 N-CHUNKS 204 ] HEAP-USED-BYTES 946967024 HEAP-UNUSED-BYTES 2752016
MemoryCheck: event : VSIZE 2374.58 0 RSS 1922.02 0 HEAP-ARENA [ SIZE-BYTES 986439680 N-UNUSED-CHUNKS 138 TOP-FREE-BYTES 287024 ] HEAP-MAPPED [ SIZE-BYTES 760352768 N-CHUNKS 356 ] HEAP-USED-BYTES 985430192 HEAP-UNUSED-BYTES 1009488
The memory report at the end indicates that basicJetsForMet consumed by far the largest amount of memory:
basicJetsForMet: n = 1 avg = 142.586 max = 142.586 run: 1 lumi: 97 event: 164559 early total: 0.015625 max: 0.015625
I also get a similar conclusion when running the job in massif:

But when I remove all MET sequences in the job, I still see the spike in memory consumption:
MemoryCheck: event : VSIZE 1889.34 0 RSS 1438.88 0 HEAP-ARENA [ SIZE-BYTES 915177472 N-UNUSED-CHUNKS 11 TOP-FREE-BYTES 34992 ] HEAP-MAPPED [ SIZE-BYTES 340193280 N-CHUNKS 213 ] HEAP-USED-BYTES 915140608 HEAP-UNUSED-BYTES 36864
MemoryCheck: event : VSIZE 2341.45 0 RSS 1886.55 0 HEAP-ARENA [ SIZE-BYTES 957337600 N-UNUSED-CHUNKS 86 TOP-FREE-BYTES 5093952 ] HEAP-MAPPED [ SIZE-BYTES 771891200 N-CHUNKS 370 ] HEAP-USED-BYTES 951056592 HEAP-UNUSED-BYTES 6281008
And massif doesn't provide any insight into the issue (at least for me):

[*] /store/mc/RunIIFall17MiniAODv2/ttHJetToNonbb_M125_TuneCP5_13TeV_amcatnloFXFX_madspin_pythia8/MINIAODSIM/PU2017_12Apr2018_new_pmx_94X_mc2017_realistic_v14-v1/100000/7C60AC2B-E76F-E811-9D60-0025905B860C.root [$] Add the following lines to cfg.py:
process.SimpleMemoryCheck = cms.Service("SimpleMemoryCheck",
  ignoreTotal          = cms.untracked.int32(1),
  oncePerEventMode     = cms.untracked.bool(True),
  moduleMemorySummary  = cms.untracked.bool(True),
  showMallocInfo       = cms.untracked.bool(True),
  monitorPssAndPrivate = cms.untracked.bool(True),
)

cmsbuild commented 5 years ago

A new Issue was created by @ktht Karl Ehatht.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 5 years ago

@bbockelm @pcanal @dan131riley FYI

Dr15Jones commented 5 years ago

I used IgProf's memory profiler along with IgProfService. I told the service to dump the memory profile after each event. I the told igprof-analyse to look at the MEM_TOTAL difference between the two dumps. It reports

% total           Total           Calls  Function
...
    4.3     658'142'615          19'780  edm::DelayedReader::getProduct(edm::BranchID const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) [9]
    4.3     658'142'615          19'780  edm::RootDelayedReader::getProduct_(edm::BranchID const&, edm::EDProductGetter const*) [10]
    4.3     658'135'973          19'618  edm::RootTree::getEntry(TBranch*, long long) const [11]
    4.3     658'135'973          19'618  TBranchElement::GetEntry(long long, int) [12]
    4.3     658'135'973          19'618  TBranch::GetEntry(long long, int) [13]
    4.3     658'135'941          19'617  TBranchElement::GetEntry(long long, int)'2 [14]
    4.3     656'512'407           3'874  TBranch::GetBasket(int) [15]
    4.3     656'512'407           3'874  TBasket::ReadBasketBuffers(long long, int, TFile*) [16]
    3.1     474'466'271             391  TBuffer::Expand(int, bool) [17]
    3.1     474'466'271             391  TStorage::ReAllocChar(char*, unsigned long, unsigned long) [18]
    3.0     452'366'774           7'101  TBranchElement::GetEntry(long long, int)'3 [19]

I then copied the one event to a new ROOT file and looked at the sizes of the TBranches in the new file. There was absolutely nothing out of the ordinary.

dan131riley commented 5 years ago

I inserted my own version of TBuffer::Expand() that prints out the old and new sizes every time it is called. Comparing the problematic event with the one before it, there isn't any single allocation responsible, instead lots of buffer allocations expand. The top two, with gDebug set to 2:

ReadBuffer, class:pat::TriggerObjectStandAlone, name=pathNames_, fType[0]=300, TStreamerSTL, bufpos=134, arr=0x7f9cfe1e9040, offset=224
Expanding buffer from 136 to 403138
Expanding buffer from 17192 to 12967119

ReadBuffer, class:pat::Photon, name=superClusterRelinked_, fType[0]=1000, TStreamerArtificial, bufpos=11569, arr=0x7f9d3363dc00, offset=1232
StreamerInfoAction, class:edm::Wrapper<vector<pat::Tau> >, name=present, fType[1]=18, TStreamerBasicType, bufpos=101, arr=0x7f9d33592ec0, offset=8 ()
Expanding buffer from 97 to 778829
Expanding buffer from 22873 to 12963093

If I merely copy that one event (no analysis), I see the same final sizes for the buffer expansions, for example:

ReadBuffer, class:pat::TriggerObjectStandAlone, name=pathNames_, fType[0]=300, TStreamerSTL, bufpos=134, arr=0x7f910327ea80, offset=224
Expanding buffer from 0 to 403138
Expanding buffer from 136 to 12967119

Copying just the previous (normal) event, there's a much smaller allocation:

ReadBuffer, class:pat::TriggerObjectStandAlone, name=pathNames_, fType[0]=300, TStreamerSTL, bufpos=12886, arr=0x7f2ed2905c00, offset=224
Expanding buffer from 136 to 18064

Adding up all the expansions for the normal event gives around 62MB, for the abnormal event 569MB, which agrees with the reported memory jump. I looked at slimmedPatTrigger and packedPFCandidates in root, and the events looked very similar in number of paths and number of particles, with the normal event having slightly more. File sizes for the individual events are similar, and dumping the branch list doesn't show any dramatic differences.

So there's definitely something weird about the buffer allocations for that event, but it doesn't seem to be isolated to any single branch, and everything else about the event looks pretty normal.

pcanal commented 5 years ago

Can you share 2 of the single event files so that I can look at what makes them different?

dan131riley commented 5 years ago

I just noticed that the single event file does not show the anomalous memory allocations: a copy job reading the single-event file takes a normal amount of memory. Copying the first 1000 events the output file also does not show the anomalous behavior. So the single event output files I have aren't useful for debugging, as the copy operation apparently fixes the problem.

I've been using CMSSW_10_2_10, as that was one of the releases cited in the original report; the file was written with CMSSW_9_4_7, and there is at least one data format change between those release. I'll try using CMSSW_9_4_7 and see if its behavior is any different.

dan131riley commented 5 years ago

CMSSW_9_4_7 shows the anomalous memory usage too. If I fast-clone the input file (which isn't possible with CMSSW_10_2_10 due to data format changes), the cloned output file shows the same behavior. If I skim off selected events, disabling fast cloning, then the output file doesn't show the problem. So there doesn't seem to be a way to produce a small output file that replicates the problem.

pcanal commented 5 years ago

Do I understand correctly that you have a file (a) that when read has a large/high memory peak but once you slow copy it to a new file (b) that 2nd file does not exhibit the high memory peak?

dan131riley commented 5 years ago

Yes, that appears to be the case.

On May 28, 2019, at 15:44, Philippe Canal notifications@github.com wrote:

Do I understand correctly that you have a file (a) that when read has a large/high memory peak but once you slow copy it to a new file (b) that 2nd file does not exhibit the high memory peak?

pcanal commented 5 years ago

So I am guessing that the input file is 'poorly' clustered and that there is one large basket somewhere ...

ktht commented 5 years ago

I found yet another example (there appears to be many more as I just started looking why my CRAB jobs failed): in file /store/data/Run2017F/SingleElectron/MINIAOD/31Mar2018-v1/30000/68217F4A-1139-E811-B2EE-0025905B85E8.root the 901st event (again !) causes abnormal increase in the memory consumption:

Begin processing the 900th record. Run 305063, Event 25795031, LumiSection 20 on stream 0 at 03-Jun-2019 15:48:38.958 EEST
MemoryCheck: event : VSIZE 2043.45 0 RSS 1345.11 0 HEAP-ARENA [ SIZE-BYTES 0 N-UNUSED-CHUNKS 1 TOP-FREE-BYTES 0 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 0 HEAP-UNUSED-BYTES 0
Begin processing the 901st record. Run 305063, Event 25859031, LumiSection 20 on stream 0 at 03-Jun-2019 15:48:39.218 EEST
MemoryCheck: event : VSIZE 2299.45 0 RSS 1615.54 0 HEAP-ARENA [ SIZE-BYTES 0 N-UNUSED-CHUNKS 1 TOP-FREE-BYTES 0 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 0 HEAP-UNUSED-BYTES 0

dan131riley commented 5 years ago

For MINIAOD we set the PoolOutputModule autoFlushSize to -900:

    eventAutoFlushCompressedSize=cms.untracked.int32(-900),

This calls

     eventTree_.setAutoFlush(-1*om->eventAutoFlushSize());

That flips the ROOT sign convention, setting the event TTree to flush all branch buffers to disk every 900 events. All the other common data formats are flushing at a size boundary, not a number of events. So if @pcanal is correct then it would appear that this setting is leading to poor clustering?

ktht commented 5 years ago

Dear experts, assuming that the issue is indeed due to poor clustering of the 900th event in the input MiniAOD files, can you propose a fix to the users who want to run custom NanoAOD production? Any special settings or functions the users could try when reading the input MiniAOD files?

dan131riley commented 5 years ago

@pcanal is there a simple way to determine if the input file is poorly clustered?

pcanal commented 5 years ago

Hi Dan,

First compare the result of thetree->Print("clusters") in the 'broken' and 'working' case.

If nothing obvious, then we have to look at the number of baskets per branch and see if any branches are anomalous. (Also we might want to check the reported basketsize).

If all of those fails, then we would need to find which branch in particular (if any) is actually provoking the peak.

bbockelm commented 5 years ago

I can't recall why we manually adjusted the number of events for the MiniAOD. I would assume it's chasing some other clustering issue when the compression ratio is inconsistent? For NanoAOD, we had to work around the ROOT logic completely...

dan131riley commented 5 years ago

I'm not having much success making sense of this.

Looking at the first example input file, according to edmFileUtil -b the first 4 branches in the Events tree have basket size of 131072 and everything else supposedly has basket size 16384, like this:

*............................................................................*
*Br    7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct     *
*Entries :    65452 : Total  Size= 1017297720 bytes  File Size  =  202031338 *
*Baskets :     2839 : Basket Size=      16384 bytes  Compression=   5.04     *
*............................................................................*

which obviously can't be right.

If I make a copy of the entire file with the standard MINIAOD output module config, I get:

*............................................................................*
*Br    7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct     *
*Entries :    65452 : Total  Size= 1017044484 bytes  File Size  =  195768122 *
*Baskets :     1043 : Basket Size=   12359248 bytes  Compression=   5.20     *
*............................................................................*

There the basket size is more sensible, but the number of baskets still looks way too large.

If I skim off the first 2700 events, again with the standard MINIAOD output config, I get:

*............................................................................*
*Br    7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct     *
*Entries :     2700 : Total  Size=   42078149 bytes  File Size  =   11062992 *
*Baskets :      904 : Basket Size=   12359248 bytes  Compression=   3.80     *
*............................................................................*

so there appear to be a huge number of baskets in the first buffer flush, and the rest of the file has a reasonable number of baskets. Cutting that down to the first 900 events gives a basket for every entry:

*............................................................................*
*Br    7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct     *
*Entries :      900 : Total  Size=   14110505 bytes  File Size  =    5767452 *
*Baskets :      900 : Basket Size=   12359248 bytes  Compression=   2.44     *
*............................................................................*

but that pattern isn't repeated for every branch, for example:

*............................................................................*
*Br  248 :patJets_slimmedJetsPuppi__PAT.obj.userFloatLabels_ :               *
*         | vector<string> userFloatLabels_[patJets_slimmedJetsPuppi__PAT.obj_]*
*Entries :      900 : Total  Size=    2687712 bytes  File Size  =      70360 *
*Baskets :      181 : Basket Size=    2761216 bytes  Compression=  38.14     *
*............................................................................*

looks to have 5 entries per basket.

If I change the flush interval from a fixed 900 events to a 30 MB target (what we use for AOD), then ROOT picks a flush interval around 400 events, and for that branch and a 2700 event skim I get:

*............................................................................*
*Br    7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct     *
*Entries :     2700 : Total  Size=   42005816 bytes  File Size  =    9292026 *
*Baskets :      391 : Basket Size=    5156864 bytes  Compression=   4.52     *
*............................................................................*

consistent with an entry per basket and flush around 400 events. The memory used by a job reading the output skim drops by ~500 MB.

This effect isn't limited to this particular branch--many others show similar anomalies, but others look more reasonable.

I'm speculating that the original input file has a large number of small baskets on some branches early in the file, which is causing a lot of memory reallocations while reading the first autoflush interval, and then a bunch of much larger allocations when the job reaches the second buffer with more reasonable basket sizes. This would at least account for the memory jump at the 901st event. This pattern of many small baskets for many branches reproduces qualitatively, but not quantitatively, on copies and skims, and seems to be exacerbated by the relatively large autoflush interval used for MINIAOD.

Does this make any sense?

bbockelm commented 5 years ago

Note the reported basket size is the "default" basket size for the branch. Once OptimizeBaskets is run (which is triggered at the end of the first auto-fill interval), these are all changed and the "basket size" becomes meaningless. For example, if the first 900 events for a given branch each hold >16KB uncompressed, you will see as a result 900 baskets.

I think the only open question here is whether we want to revisit the fixed number of events for this data tier; you might need to start git blameing the configuration (let's hope we wrote a good commit message!).

This looks a lot like the NanoAOD case, where we actually want to optimize the baskets prior to accumulating enough data to actually create a cluster because, otherwise, the basket layout for that first cluster is horrible. For NanoAOD, we had to include quiet a bit of code into CMSSW to work around ROOT "features"; I'd prefer to not have to do this in EDM files as well.

dan131riley commented 5 years ago

So we expect poor layout in the first auto-fill interval, especially for large objects, and we expect a bunch of buffer expansions at the end of the first interval when the basket size gets adjusted. One obvious question is why we haven't seen this problem with memory consumption before? Is there something unusual about these datasets?

Current compression settings come from PR #19589. The PR and commit comments don't mention MINIAOD, but there's a comment in EventContent_cff.py:

# MiniAOD is a bit special: the files tend to be so small that letting
# ROOT automatically determine when to flush is a surprisingly big overhead.

The files in question are 4 GB, so that may not be so relevant now--should we make MINIAODSIMEventContent autoflush 20 MB to match RAWMINIAODEventContent? What about MINIAODEventContent?

pcanal commented 5 years ago

these are all changed and the "basket size" becomes meaningless.

I am not sure what you mean. After the OptimizeBaskets, the value show in TTree::Print for each branch is indeed the basket size that OptimizeBaskets picked. [Also in the current algorithm the initiali basket size has some influence on the resulting basket size]

pcanal commented 5 years ago

So we expect poor layout in the first auto-fill interval, especially for large objects

Indeed if the user provided size is not 'adequate' it will still be used until the end of the first auto-fill interval (assumingly there is not enough information until then to have a good guess of the 'proper' size.

pcanal commented 5 years ago

Looking at the first example input file ... everything else supposedly has basket size 16384

it seems that something is preventing the call to OptimizeBasket during this file's production.

*............................................................................*
*Br    7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct     *
*Entries :    65452 : Total  Size= 1017297720 bytes  File Size  =  202031338 *
*Baskets :     2839 : Basket Size=      16384 bytes  Compression=   5.04     *
*............................................................................*

This and the rest of Dan's investigation are consistent with each entry containing an LHEEventProduct of size 15538.783, so indeed only one fits in each 16K baskets. So for this branch, we will get 1 basket per entry until the end of the first cluster at which point the basket is resize to be closer to being one basket per cluster (it appears in this case we end up with 1.7ish basket per cluster).

I'm speculating that the original input file has a large number of small baskets on some branches early in the file, which is causing a lot of memory reallocations while reading the first autoflush interval, and then a bunch of much larger allocations when the job reaches the second buffer with more reasonable basket sizes. This would at least account for the memory jump at the 901st event.

This seems about right.

This pattern of many small baskets for many branches reproduces qualitatively, but not quantitatively, on copies and skims, and seems to be exacerbated by the relatively large autoflush interval used for MINIAOD.

It is also exacerbated by the mismatch in initial branch size with the real data size.

dan131riley commented 5 years ago

It looks like we have the ability to set the initial basket size for a class as an attribute in the data dictionary:

https://github.com/cms-sw/cmssw/blob/fbec54cb4785ecbc2a81c1f86a17d6d28ea747d4/DataFormats/Provenance/src/BranchDescription.cc#L218-L219

We aren't using that at all. Should we consider doing it for classes that are usually large and practically never split (particularly LHEEventProduct)?

pcanal commented 5 years ago

Yes, I would recommend at the very least to try it out :)

cms-sw / cmssw

Sudden spike in memory consumption when running NanoAOD #26954