cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

T0 prompt-RECO memory problem in run 379524 #44795

Closed Dr15Jones closed 4 months ago

Dr15Jones commented 5 months ago

The prompt reco for run 379524 was failing for luminosity block 75 due to excessive memory usage:

https://cms-talk.web.cern.ch/t/high-memory-usage-for-promptreco-specialzerobias-pd/39326/6

After some investigation, type problem was isolated to one Event causing the problem run 379524, lumi 75, event 178862574

.

Dr15Jones commented 5 months ago

assign reconstruction

cmsbuild commented 5 months ago

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 5 months ago

cms-bot internal usage

cmsbuild commented 5 months ago

A new Issue was created by @Dr15Jones.

@smuzaffar, @makortel, @rappoccio, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 5 months ago

One can run the job by replacing the output modules with AsciiOutputModules (to save time and space). If one un-tars the job files and copies the input file to a local directory, one can use the following script to run the job with the replaced output modules

from PSet import process
import FWCore.ParameterSet.Config as cms

process.source.fileNames[0] = 'file:10365d29-a39a-4e8a-83aa-2d0fe1a638a2.root'
#using the following will allow the job to succeed
#process.source.eventsToSkip = cms.untracked.VEventRange( [cms.EventRange("379524:75:178862574")])

# this should jump right to the problem event
#process.source.skipEvents = cms.untracked.uint32(12661)

for name,mod in process.outputModules.items():
    n = cms.OutputModule("AsciiOutputModule", outputCommands = mod.outputCommands, verbosity = cms.untracked.uint32(0))
    setattr(process,name,n)
mandrenguyen commented 5 months ago

Thanks @Dr15Jones I confirm that I can reproduce the problem with the Pkl. The memory is exceeding 15 GB RSS.

The reco is getting stuck in the following module, which is new for this year IIRC: process.displacedRegionalStepSeedingVertices cms.EDProducer("DisplacedRegionSeedingVertexProducer",

@cms-sw/tracking-pog-l2 any ideas how to properly protect this module, e.g., from beam backgrounds?

mmusich commented 5 months ago

@aehart FYI

Dr15Jones commented 4 months ago

@mandrenguyen Looking at the log I generated (which includes a prototype per module memory monitor) I see that the job also got stuck for 20 minutes in CkfTrackCandidateMaker:jetCoreRegionalStepEndcapTrackCandidates.

Here is some of the info I found from the memory monitor for this event

DisplacedRegionSeedingVertexProducer:displacedRegionalStepSeedingVertices # allocs: 4,867,272,868 # dealocs 4,867,272,858 memory added to job after produce finished (bytes): 4,194,496 max memory allocated by the module during the event at one time (bytes): 5,697,859,040

The next largest memory users were V0Producer:displacedRegionalStepSeedingV0Candidates # allocs: 12,492,305 # deallocs: 12,100,008 memory added to job after produce finished (bytes):69,122,400 max memory allocated by the module during the event at one time (bytes):138,244,544

TrackProducer:jetCoreRegionalStepEndcapTracks # allocs: 4,720,722 # deallocs: 4,605,171 memory added to job after produce finished (bytes):15,400,392 max memory allocated by the module during the event at one time (bytes):127,930,264

mandrenguyen commented 4 months ago

By adding the following lines to PSet.py the reconstruction finishes successfully: process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

There are plenty of messages like the following, which might be a clue how to get the displacedRegionalStepSeedingVertices to quit early:

%MSG-w TooManyPFDVCandidates: PFDisplacedVertexProducer:particleFlowDisplacedVertex 22-Apr-2024 15:14:27 CEST Run: 379524 Event: 178862574 gave up vertex reco for 1254 tracks

mandrenguyen commented 4 months ago

@Dr15Jones Yes, I also notice jetCoreRegionalStepEndcapTrackCandidates

Although it takes a long time, displacedRegionalStepSeedingVertices looks even more problematic, so I figured we should start there.

dan131riley commented 4 months ago

If there's an event that takes a huge amount of resources, it very likely will be the last to complete, so you get a memory spike just before the endjob can complete.

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

slava77 commented 4 months ago

By adding the following lines to PSet.py the reconstruction finishes successfully: process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

doesn't this just disable the tracking in this iteration?

As in a few other cases, there should be just some cutoff to combinatorics.

slava77 commented 4 months ago

type tracking

Dr15Jones commented 4 months ago

We have other cases of RECO taking ridiculous resources for beam splash events, see https://github.com/cms-sw/cmssw/issues/37362. Is this similar?

In this case, the event appeared early enough to basically kill the job 2/3rds the way through the LuminosityBlock. But if such an event appeared closer to the end, it would also spike just before the end job transition.

slava77 commented 4 months ago

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

tracker is off during splashes

mandrenguyen commented 4 months ago

By adding the following lines to PSet.py the reconstruction finishes successfully: process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

doesn't this just disable the tracking in this iteration?

As in a few other cases, there should be just some cutoff to combinatorics.

Right, I just wanted to check whether there was another problematic module downstream, but it appears not to be the case. Of course you're right that some cutoff is needed for combinatorics. You're the expert :-)

davidlange6 commented 4 months ago

as coded, DisplacedRegionSeedingVertexProducer is going to do a lot of allocations given big events... what is its typical cpu fraction in a normal pu60 event?

Dr15Jones commented 4 months ago

as coded, DisplacedRegionSeedingVertexProducer is going to do a lot of allocations given big events... what is its typical cpu fraction in a normal pu60 event?

So grepping the log for that module's memory use, it looks like the vast majority of Events use at one time 260,000 bytes with one event using 5,686,768 and then the one pathological event using 5,697,859,040.

davidlange6 commented 4 months ago

https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedRegionSeedingVertexProducer.cc#L125

seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid)

Dr15Jones commented 4 months ago

So I did a log/log histogram of the max amount of memory used at one time by DisplacedRegionSeedingVertexProducer for each event (x axis) vs the number of events with that memory usage: image

Dr15Jones commented 4 months ago

seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid)

The use of std::list probably isn't helping much either as it is probably the cause of the huge number of allocations/deallocations.

VinInn commented 4 months ago

The algorithm looks similar (in intent) to FastJet one... Maybe somebody can investigate if using the fastJet Heap (that allows removal of any element not just pop) would not be more efficient

see https://fastjet.fr/repo/doxygen-3.4.2/classfastjet_1_1MinHeap.html how to use it? https://fastjet.fr/repo/doxygen-3.2.0/ClusterSequence__TiledN2_8cc_source.html line 820 (compare with the slower methods above)

VinInn commented 4 months ago

Of course CMSSW is plenty of other, better behaving, cluster algorithms: reuse maybe worth.

davidlange6 commented 4 months ago

I looked a bit more - in the event in question

so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise..

anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc)

slava77 commented 4 months ago

I looked a bit more - in the event in question

* it takes 20 minutes on a pretty fast machine to even get to the problematic module (so there are other issues for this event)

* there are 130k `pseudoROIs` (as each `TrackCluster` becomes a `pseudoROI`) (*88 bytes each is 10MB)

* then there are 300M combos of `pseudoROIs` considered in the distance calculation (*16 bytes each is 5GB)

so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise..

anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc)

@aehart is it practical to truncate pseudoROIs at 10K or will it affect some possible physics of (your) interest?

slava77 commented 4 months ago

is it practical to truncate pseudoROIs at 10K or will it affect some possible physics of (your) interest?

I got a response by email from Yuri: we are never supposed to have so many ROIs! A mild cut in it would be just fine.

Since @davidlange6 you already have this setup in place, perhaps you can check if a truncation at 10K solves the memory/timing problem.

davidlange6 commented 4 months ago

ha - i've rewritten a bunch of stuff... can have a look tomorrow I think.

The first 10k is sufficient? [that should certainly solve this..]

davidlange6 commented 4 months ago

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

slava77 commented 4 months ago

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps. I'm not sure which power is the explosion. Would 3K be enough?

mandrenguyen commented 4 months ago

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps. I'm not sure which power is the explosion. Would 3K be enough?

I'm not sure I would use the running time of preceding modules to decide where to put the limit. It appears there is a problem with jetCoreRegionalStepEndcapTracks that presumably should also be fixed by a seeding limit of one type or another. Barring that module the event is processed much faster. Shouldn't we instead decide on the limit based on the distribution of pseudoROIs in normal collision events? I think it's reasonably safe to assume this is some sort of beam background event.

davidlange6 commented 4 months ago

ok, my 1 hr is now like 1 minute after merging some code improvements.

https://github.com/cms-sw/cmssw/pull/44829

mandrenguyen commented 4 months ago

+1 I think this is issue can be closed now after the fix from David: https://github.com/cms-sw/cmssw/pull/44829

cmsbuild commented 4 months ago

This issue is fully signed and ready to be closed.

makortel commented 4 months ago

@cmsbuild, please close