Open ktht opened 5 years ago
A new Issue was created by @ktht Karl Ehatht.
@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@bbockelm @pcanal @dan131riley FYI
I used IgProf's memory profiler along with IgProfService. I told the service to dump the memory profile after each event. I the told igprof-analyse to look at the MEM_TOTAL difference between the two dumps. It reports
% total Total Calls Function
...
4.3 658'142'615 19'780 edm::DelayedReader::getProduct(edm::BranchID const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) [9]
4.3 658'142'615 19'780 edm::RootDelayedReader::getProduct_(edm::BranchID const&, edm::EDProductGetter const*) [10]
4.3 658'135'973 19'618 edm::RootTree::getEntry(TBranch*, long long) const [11]
4.3 658'135'973 19'618 TBranchElement::GetEntry(long long, int) [12]
4.3 658'135'973 19'618 TBranch::GetEntry(long long, int) [13]
4.3 658'135'941 19'617 TBranchElement::GetEntry(long long, int)'2 [14]
4.3 656'512'407 3'874 TBranch::GetBasket(int) [15]
4.3 656'512'407 3'874 TBasket::ReadBasketBuffers(long long, int, TFile*) [16]
3.1 474'466'271 391 TBuffer::Expand(int, bool) [17]
3.1 474'466'271 391 TStorage::ReAllocChar(char*, unsigned long, unsigned long) [18]
3.0 452'366'774 7'101 TBranchElement::GetEntry(long long, int)'3 [19]
I then copied the one event to a new ROOT file and looked at the sizes of the TBranches in the new file. There was absolutely nothing out of the ordinary.
I inserted my own version of TBuffer::Expand()
that prints out the old and new sizes every time it is called. Comparing the problematic event with the one before it, there isn't any single allocation responsible, instead lots of buffer allocations expand. The top two, with gDebug
set to 2:
ReadBuffer, class:pat::TriggerObjectStandAlone, name=pathNames_, fType[0]=300, TStreamerSTL, bufpos=134, arr=0x7f9cfe1e9040, offset=224
Expanding buffer from 136 to 403138
Expanding buffer from 17192 to 12967119
ReadBuffer, class:pat::Photon, name=superClusterRelinked_, fType[0]=1000, TStreamerArtificial, bufpos=11569, arr=0x7f9d3363dc00, offset=1232
StreamerInfoAction, class:edm::Wrapper<vector<pat::Tau> >, name=present, fType[1]=18, TStreamerBasicType, bufpos=101, arr=0x7f9d33592ec0, offset=8 ()
Expanding buffer from 97 to 778829
Expanding buffer from 22873 to 12963093
If I merely copy that one event (no analysis), I see the same final sizes for the buffer expansions, for example:
ReadBuffer, class:pat::TriggerObjectStandAlone, name=pathNames_, fType[0]=300, TStreamerSTL, bufpos=134, arr=0x7f910327ea80, offset=224
Expanding buffer from 0 to 403138
Expanding buffer from 136 to 12967119
Copying just the previous (normal) event, there's a much smaller allocation:
ReadBuffer, class:pat::TriggerObjectStandAlone, name=pathNames_, fType[0]=300, TStreamerSTL, bufpos=12886, arr=0x7f2ed2905c00, offset=224
Expanding buffer from 136 to 18064
Adding up all the expansions for the normal event gives around 62MB, for the abnormal event 569MB, which agrees with the reported memory jump. I looked at slimmedPatTrigger and packedPFCandidates in root, and the events looked very similar in number of paths and number of particles, with the normal event having slightly more. File sizes for the individual events are similar, and dumping the branch list doesn't show any dramatic differences.
So there's definitely something weird about the buffer allocations for that event, but it doesn't seem to be isolated to any single branch, and everything else about the event looks pretty normal.
Can you share 2 of the single event files so that I can look at what makes them different?
I just noticed that the single event file does not show the anomalous memory allocations: a copy job reading the single-event file takes a normal amount of memory. Copying the first 1000 events the output file also does not show the anomalous behavior. So the single event output files I have aren't useful for debugging, as the copy operation apparently fixes the problem.
I've been using CMSSW_10_2_10, as that was one of the releases cited in the original report; the file was written with CMSSW_9_4_7, and there is at least one data format change between those release. I'll try using CMSSW_9_4_7 and see if its behavior is any different.
CMSSW_9_4_7 shows the anomalous memory usage too. If I fast-clone the input file (which isn't possible with CMSSW_10_2_10 due to data format changes), the cloned output file shows the same behavior. If I skim off selected events, disabling fast cloning, then the output file doesn't show the problem. So there doesn't seem to be a way to produce a small output file that replicates the problem.
Do I understand correctly that you have a file (a) that when read has a large/high memory peak but once you slow copy it to a new file (b) that 2nd file does not exhibit the high memory peak?
Yes, that appears to be the case.
On May 28, 2019, at 15:44, Philippe Canal notifications@github.com wrote:
Do I understand correctly that you have a file (a) that when read has a large/high memory peak but once you slow copy it to a new file (b) that 2nd file does not exhibit the high memory peak?
So I am guessing that the input file is 'poorly' clustered and that there is one large basket somewhere ...
I found yet another example (there appears to be many more as I just started looking why my CRAB jobs failed): in file /store/data/Run2017F/SingleElectron/MINIAOD/31Mar2018-v1/30000/68217F4A-1139-E811-B2EE-0025905B85E8.root
the 901st event (again !) causes abnormal increase in the memory consumption:
Begin processing the 900th record. Run 305063, Event 25795031, LumiSection 20 on stream 0 at 03-Jun-2019 15:48:38.958 EEST
MemoryCheck: event : VSIZE 2043.45 0 RSS 1345.11 0 HEAP-ARENA [ SIZE-BYTES 0 N-UNUSED-CHUNKS 1 TOP-FREE-BYTES 0 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 0 HEAP-UNUSED-BYTES 0
Begin processing the 901st record. Run 305063, Event 25859031, LumiSection 20 on stream 0 at 03-Jun-2019 15:48:39.218 EEST
MemoryCheck: event : VSIZE 2299.45 0 RSS 1615.54 0 HEAP-ARENA [ SIZE-BYTES 0 N-UNUSED-CHUNKS 1 TOP-FREE-BYTES 0 ] HEAP-MAPPED [ SIZE-BYTES 0 N-CHUNKS 0 ] HEAP-USED-BYTES 0 HEAP-UNUSED-BYTES 0
For MINIAOD we set the PoolOutputModule autoFlushSize to -900:
eventAutoFlushCompressedSize=cms.untracked.int32(-900),
This calls
eventTree_.setAutoFlush(-1*om->eventAutoFlushSize());
That flips the ROOT sign convention, setting the event TTree to flush all branch buffers to disk every 900 events. All the other common data formats are flushing at a size boundary, not a number of events. So if @pcanal is correct then it would appear that this setting is leading to poor clustering?
Dear experts, assuming that the issue is indeed due to poor clustering of the 900th event in the input MiniAOD files, can you propose a fix to the users who want to run custom NanoAOD production? Any special settings or functions the users could try when reading the input MiniAOD files?
@pcanal is there a simple way to determine if the input file is poorly clustered?
Hi Dan,
First compare the result of thetree->Print("clusters")
in the 'broken' and 'working' case.
If nothing obvious, then we have to look at the number of baskets per branch and see if any branches are anomalous. (Also we might want to check the reported basketsize).
If all of those fails, then we would need to find which branch in particular (if any) is actually provoking the peak.
I can't recall why we manually adjusted the number of events for the MiniAOD. I would assume it's chasing some other clustering issue when the compression ratio is inconsistent? For NanoAOD, we had to work around the ROOT logic completely...
I'm not having much success making sense of this.
Looking at the first example input file, according to edmFileUtil -b
the first 4 branches in the Events tree have basket size of 131072 and everything else supposedly has basket size 16384, like this:
*............................................................................*
*Br 7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct *
*Entries : 65452 : Total Size= 1017297720 bytes File Size = 202031338 *
*Baskets : 2839 : Basket Size= 16384 bytes Compression= 5.04 *
*............................................................................*
which obviously can't be right.
If I make a copy of the entire file with the standard MINIAOD output module config, I get:
*............................................................................*
*Br 7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct *
*Entries : 65452 : Total Size= 1017044484 bytes File Size = 195768122 *
*Baskets : 1043 : Basket Size= 12359248 bytes Compression= 5.20 *
*............................................................................*
There the basket size is more sensible, but the number of baskets still looks way too large.
If I skim off the first 2700 events, again with the standard MINIAOD output config, I get:
*............................................................................*
*Br 7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct *
*Entries : 2700 : Total Size= 42078149 bytes File Size = 11062992 *
*Baskets : 904 : Basket Size= 12359248 bytes Compression= 3.80 *
*............................................................................*
so there appear to be a huge number of baskets in the first buffer flush, and the rest of the file has a reasonable number of baskets. Cutting that down to the first 900 events gives a basket for every entry:
*............................................................................*
*Br 7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct *
*Entries : 900 : Total Size= 14110505 bytes File Size = 5767452 *
*Baskets : 900 : Basket Size= 12359248 bytes Compression= 2.44 *
*............................................................................*
but that pattern isn't repeated for every branch, for example:
*............................................................................*
*Br 248 :patJets_slimmedJetsPuppi__PAT.obj.userFloatLabels_ : *
* | vector<string> userFloatLabels_[patJets_slimmedJetsPuppi__PAT.obj_]*
*Entries : 900 : Total Size= 2687712 bytes File Size = 70360 *
*Baskets : 181 : Basket Size= 2761216 bytes Compression= 38.14 *
*............................................................................*
looks to have 5 entries per basket.
If I change the flush interval from a fixed 900 events to a 30 MB target (what we use for AOD), then ROOT picks a flush interval around 400 events, and for that branch and a 2700 event skim I get:
*............................................................................*
*Br 7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct *
*Entries : 2700 : Total Size= 42005816 bytes File Size = 9292026 *
*Baskets : 391 : Basket Size= 5156864 bytes Compression= 4.52 *
*............................................................................*
consistent with an entry per basket and flush around 400 events. The memory used by a job reading the output skim drops by ~500 MB.
This effect isn't limited to this particular branch--many others show similar anomalies, but others look more reasonable.
I'm speculating that the original input file has a large number of small baskets on some branches early in the file, which is causing a lot of memory reallocations while reading the first autoflush interval, and then a bunch of much larger allocations when the job reaches the second buffer with more reasonable basket sizes. This would at least account for the memory jump at the 901st event. This pattern of many small baskets for many branches reproduces qualitatively, but not quantitatively, on copies and skims, and seems to be exacerbated by the relatively large autoflush interval used for MINIAOD.
Does this make any sense?
Note the reported basket size is the "default" basket size for the branch. Once OptimizeBaskets
is run (which is triggered at the end of the first auto-fill interval), these are all changed and the "basket size" becomes meaningless. For example, if the first 900 events for a given branch each hold >16KB uncompressed, you will see as a result 900 baskets.
I think the only open question here is whether we want to revisit the fixed number of events for this data tier; you might need to start git blame
ing the configuration (let's hope we wrote a good commit message!).
This looks a lot like the NanoAOD case, where we actually want to optimize the baskets prior to accumulating enough data to actually create a cluster because, otherwise, the basket layout for that first cluster is horrible. For NanoAOD, we had to include quiet a bit of code into CMSSW to work around ROOT "features"; I'd prefer to not have to do this in EDM files as well.
So we expect poor layout in the first auto-fill interval, especially for large objects, and we expect a bunch of buffer expansions at the end of the first interval when the basket size gets adjusted. One obvious question is why we haven't seen this problem with memory consumption before? Is there something unusual about these datasets?
Current compression settings come from PR #19589. The PR and commit comments don't mention MINIAOD, but there's a comment in EventContent_cff.py:
# MiniAOD is a bit special: the files tend to be so small that letting
# ROOT automatically determine when to flush is a surprisingly big overhead.
The files in question are 4 GB, so that may not be so relevant now--should we make MINIAODSIMEventContent autoflush 20 MB to match RAWMINIAODEventContent? What about MINIAODEventContent?
these are all changed and the "basket size" becomes meaningless.
I am not sure what you mean. After the OptimizeBaskets, the value show in TTree::Print for each branch is indeed the basket size that OptimizeBaskets picked. [Also in the current algorithm the initiali basket size has some influence on the resulting basket size]
So we expect poor layout in the first auto-fill interval, especially for large objects
Indeed if the user provided size is not 'adequate' it will still be used until the end of the first auto-fill interval (assumingly there is not enough information until then to have a good guess of the 'proper' size.
Looking at the first example input file ... everything else supposedly has basket size 16384
it seems that something is preventing the call to OptimizeBasket during this file's production.
*............................................................................*
*Br 7 :LHEEventProduct_externalLHEProducer__SIM.obj : LHEEventProduct *
*Entries : 65452 : Total Size= 1017297720 bytes File Size = 202031338 *
*Baskets : 2839 : Basket Size= 16384 bytes Compression= 5.04 *
*............................................................................*
This and the rest of Dan's investigation are consistent with each entry containing an LHEEventProduct of size 15538.783, so indeed only one fits in each 16K baskets. So for this branch, we will get 1 basket per entry until the end of the first cluster at which point the basket is resize to be closer to being one basket per cluster (it appears in this case we end up with 1.7ish basket per cluster).
I'm speculating that the original input file has a large number of small baskets on some branches early in the file, which is causing a lot of memory reallocations while reading the first autoflush interval, and then a bunch of much larger allocations when the job reaches the second buffer with more reasonable basket sizes. This would at least account for the memory jump at the 901st event.
This seems about right.
This pattern of many small baskets for many branches reproduces qualitatively, but not quantitatively, on copies and skims, and seems to be exacerbated by the relatively large autoflush interval used for MINIAOD.
It is also exacerbated by the mismatch in initial branch size with the real data size.
It looks like we have the ability to set the initial basket size for a class as an attribute in the data dictionary:
We aren't using that at all. Should we consider doing it for classes that are usually large and practically never split (particularly LHEEventProduct)?
Yes, I would recommend at the very least to try it out :)
Issue originally described here: https://github.com/cms-nanoAOD/cmssw/issues/373 It doesn't seem to be specific to NanoAOD framework, though. Literally any module or task causes the memory to jump between the events, and the jump is proportional to the memory consumed by the tasks and modules. Looks like some interruption triggered by the offending event causes the workflow to re-allocate memory for the modules.
Original post: