Infinite(?) recursion in Basic2DGenericTopoClusterizer::buildTopoCluster() in WF 138.3

dan131riley commented 2 years ago

We're getting segmentation faults in WF 138.3 that don't leave a stack trace, particularly in ASAN builds. With a partial debug build, I get a stack trace with over 17,000 stack frames:

#0  0x00007fb1d6b06de0 in edm::refitem::GetRefPtrImpl<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> >, reco::PFRecHit, edm::refhelper::FindUsingAdvance<std::vector<reco::PFRecHit, std::allocator<reco::PFRecHit> >, reco::PFRecHit>, unsigned int>::getRefPtr_(edm::RefCore const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_ASAN_X_2022-03-25-1100/lib/slc7_amd64_gcc11/libDataFormatsParticleFlowReco.so
#1  0x00007fb1d6b0619d in reco::PFCluster::addRecHitFraction(reco::PFRecHitFraction const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_ASAN_X_2022-03-25-1100/lib/slc7_amd64_gcc11/libDataFormatsParticleFlowReco.so
#2  0x00007fb17c15b7ff in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2973, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:108
#3  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2972, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#4  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2971, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#5  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2970, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#6  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2969, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#7  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2968, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#8  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2967, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#9  0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=2966, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
(More stack frames follow...)
#17430 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53726, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17431 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53725, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17432 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53724, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17433 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53723, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17434 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53722, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17435 0x00007fb17c15ba63 in Basic2DGenericTopoClusterizer::buildTopoCluster (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, kcell=53721, used=std::vector<bool> of length 74283, capacity 74304 = {...}, topocluster=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:121
#17436 0x00007fb17c15aebb in Basic2DGenericTopoClusterizer::buildClusters (this=0x60f000ec0530, input=..., rechitMask=std::vector<bool> of length 74283, capacity 74304 = {...}, seedable=std::vector<bool> of length 74283, capacity 74304 = {...}, output=std::vector of length 3, capacity 4 = {...}) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc:67
#17437 0x00007fb17c265574 in PFClusterProducer::produce (this=0x61800096ec80, e=..., es=...) at /mnt/data1/dsr/tmp/CMSSW_12_4_ASAN_X_2022-03-25-1100/src/RecoParticleFlow/PFClusterProducer/plugins/PFClusterProducer.cc:146

cmsbuild commented 2 years ago

A new Issue was created by @dan131riley Dan Riley.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 2 years ago

assign reconstruction

cmsbuild commented 2 years ago

New categories assigned: reconstruction

@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

jpata commented 2 years ago

Thanks, I'm able to reproduce this with the latest ASAN build CMSSW_12_4_ASAN_X_2022-03-25-1100. I'm going to check CMSSW_12_4_ASAN_X_2022-03-21-1100 as well. EDIT: also reproducible in this release.

Are earlier ASAN builds available somewhere to check when the problem first appeared?

dan131riley commented 2 years ago

We don't have earlier ASAN builds, as far as I know. My recollection is that WF 138.3 has been failing in ASAN for at least a few months, but there's previously been no followup because the failures never leave a stack trace.

jpata commented 2 years ago

I looked into this a bit, and it seems buildTopoCluster recurses over all the rechits, which can easily be 50k or more. https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L121

This does not crash the normal release, but it does crash ASAN (possibly a canary in the coalmine). We should look for ways to limit the recursion stack depth here.

Adding some PF people here: @laurenhay @marksan87 @cms-sw/pf-l2

jpata commented 2 years ago

a kind ping on this

jpata commented 2 years ago

type pf

slava77 commented 2 years ago

WF 138.3 is beam splash events. Clustering the full calorimeter could be hard ;)

Are there any memory limits that these deep call stacks are hitting in ASAN?

dan131riley commented 2 years ago

It's probably hitting the stack size limit, and then timing out trying to generate the backtrace

smuzaffar commented 2 years ago

It is not infinite loop but https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L73-L123 is recursively called for over 74K times. For non-ASAN IBs this does not hit the stack size but for ASAN IBs (which uses 2-3 times more memory) it hits the limits and that is why it is crashing.

jpata commented 2 years ago

@laurenhay @cms-sw/pf-l2 is the recursion over all rechits (splash or not) really meaningful here, or can perhaps it be split up across the event?

iarspider commented 1 year ago

@cms-sw/pf-l2 gentle ping on this issue.

iarspider commented 1 year ago

@cms-sw/pf-l2 gentle ping on this issue.

swagata87 commented 1 year ago

From a very quick look, one thing that strikes me is the thresholds for preshower, which is set to 6e-05 GeV. Does this make sense @cms-sw/ecal-dpg-l2 @hatakeyamak ? https://github.com/cms-sw/cmssw/blob/CMSSW_13_2_X/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterPS_cfi.py#L28-L33 For other subdetectors the thresholds look reasonable, and recHits with very low energy gets rejected. But for preshower, it seems from a cout statement that I gave to check, that the algo even tries to cluster recHits of very very low energy like 0.00016608 GeV , 0.000168437 GeV etc. can this threshold for preshower be updated? This is mostly a question for ECAL DPG.

swagata87 commented 1 year ago

Thinking more about beam-splash, it is not guaranteed that PF clustering would work out-of-the box for beam-splashes too. PF clustering techniques are meant for real physics processes. If we want to run clustering also for beam-splashes, then maybe we can create a separate config (via an era perhaps?) and increase the seeding/gathering thresholds to very high values, so that buildTopoCluster is not called too many times.

As a test, the following changes help to avoid the reported crash (tested in CMSSW_13_3_ASAN_X_2023-09-27-2300):

--- a/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py
+++ b/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py

     cms.PSet( detector = cms.string("ECAL_ENDCAP"),
-              seedingThreshold = cms.double(0.60),
-              seedingThresholdPt = cms.double(0.15)
+              seedingThreshold = cms.double(100), #0.60
+              seedingThresholdPt = cms.double(20) #0.15

     cms.PSet( detector = cms.string("ECAL_BARREL"),
-              seedingThreshold = cms.double(0.23),
-              seedingThresholdPt = cms.double(0.0)
+              seedingThreshold = cms.double(100), #0.23
+              seedingThresholdPt = cms.double(20)

     cms.PSet( detector = cms.string("ECAL_BARREL"),
-              gatheringThreshold = cms.double(0.08),
+              gatheringThreshold = cms.double(90), #0.08

     cms.PSet( detector = cms.string("ECAL_ENDCAP"),
-              gatheringThreshold = cms.double(0.3),
+              gatheringThreshold = cms.double(90), #0.3

Of course this would be a special setting only for beam-splash, assuming nobody is interested to reconstruct high-level physics objects in beam-splash events.

iarspider commented 10 months ago

@cms-sw/ecal-dpg-l2 @hatakeyamak please comment on the solution proposed in https://github.com/cms-sw/cmssw/issues/37362#issuecomment-1741982958

hatakeyamak commented 10 months ago

I was traveling last week with not in good condition, but I am back. Probably for now some cutoff for topo-cluster for safe guard could make sense.

stahlleiton commented 5 months ago

Is there something pending to close this issue?

cms-sw / cmssw

Infinite(?) recursion in Basic2DGenericTopoClusterizer::buildTopoCluster() in WF 138.3 #37362