Open dan131riley opened 2 years ago
A new Issue was created by @dan131riley Dan Riley.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction
New categories assigned: reconstruction
@jpata,@slava77,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
Thanks, I'm able to reproduce this with the latest ASAN build CMSSW_12_4_ASAN_X_2022-03-25-1100. I'm going to check CMSSW_12_4_ASAN_X_2022-03-21-1100 as well. EDIT: also reproducible in this release.
Are earlier ASAN builds available somewhere to check when the problem first appeared?
We don't have earlier ASAN builds, as far as I know. My recollection is that WF 138.3 has been failing in ASAN for at least a few months, but there's previously been no followup because the failures never leave a stack trace.
I looked into this a bit, and it seems buildTopoCluster
recurses over all the rechits, which can easily be 50k or more.
https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L121
This does not crash the normal release, but it does crash ASAN (possibly a canary in the coalmine). We should look for ways to limit the recursion stack depth here.
Adding some PF people here: @laurenhay @marksan87 @cms-sw/pf-l2
a kind ping on this
type pf
WF 138.3 is beam splash events. Clustering the full calorimeter could be hard ;)
Are there any memory limits that these deep call stacks are hitting in ASAN?
It's probably hitting the stack size limit, and then timing out trying to generate the backtrace
It is not infinite loop but https://github.com/cms-sw/cmssw/blob/5f75b2d25170c527f84074e9e817231cedd8ba8b/RecoParticleFlow/PFClusterProducer/plugins/Basic2DGenericTopoClusterizer.cc#L73-L123 is recursively called for over 74K times. For non-ASAN IBs this does not hit the stack size but for ASAN IBs (which uses 2-3 times more memory) it hits the limits and that is why it is crashing.
@laurenhay @cms-sw/pf-l2 is the recursion over all rechits (splash or not) really meaningful here, or can perhaps it be split up across the event?
@cms-sw/pf-l2 gentle ping on this issue.
@cms-sw/pf-l2 gentle ping on this issue.
From a very quick look, one thing that strikes me is the thresholds for preshower, which is set to 6e-05 GeV. Does this make sense @cms-sw/ecal-dpg-l2 @hatakeyamak ? https://github.com/cms-sw/cmssw/blob/CMSSW_13_2_X/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterPS_cfi.py#L28-L33 For other subdetectors the thresholds look reasonable, and recHits with very low energy gets rejected. But for preshower, it seems from a cout statement that I gave to check, that the algo even tries to cluster recHits of very very low energy like 0.00016608 GeV , 0.000168437 GeV etc. can this threshold for preshower be updated? This is mostly a question for ECAL DPG.
Thinking more about beam-splash, it is not guaranteed that PF clustering would work out-of-the box for beam-splashes too. PF clustering techniques are meant for real physics processes. If we want to run clustering also for beam-splashes, then maybe we can create a separate config (via an era
perhaps?) and increase the seeding/gathering thresholds to very high values, so that buildTopoCluster
is not called too many times.
As a test, the following changes help to avoid the reported crash (tested in CMSSW_13_3_ASAN_X_2023-09-27-2300
):
--- a/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py
+++ b/RecoParticleFlow/PFClusterProducer/python/particleFlowClusterECALUncorrected_cfi.py
cms.PSet( detector = cms.string("ECAL_ENDCAP"),
- seedingThreshold = cms.double(0.60),
- seedingThresholdPt = cms.double(0.15)
+ seedingThreshold = cms.double(100), #0.60
+ seedingThresholdPt = cms.double(20) #0.15
cms.PSet( detector = cms.string("ECAL_BARREL"),
- seedingThreshold = cms.double(0.23),
- seedingThresholdPt = cms.double(0.0)
+ seedingThreshold = cms.double(100), #0.23
+ seedingThresholdPt = cms.double(20)
cms.PSet( detector = cms.string("ECAL_BARREL"),
- gatheringThreshold = cms.double(0.08),
+ gatheringThreshold = cms.double(90), #0.08
cms.PSet( detector = cms.string("ECAL_ENDCAP"),
- gatheringThreshold = cms.double(0.3),
+ gatheringThreshold = cms.double(90), #0.3
Of course this would be a special setting only for beam-splash, assuming nobody is interested to reconstruct high-level physics objects in beam-splash events.
@cms-sw/ecal-dpg-l2 @hatakeyamak please comment on the solution proposed in https://github.com/cms-sw/cmssw/issues/37362#issuecomment-1741982958
I was traveling last week with not in good condition, but I am back. Probably for now some cutoff for topo-cluster for safe guard could make sense.
Is there something pending to close this issue?
We're getting segmentation faults in WF 138.3 that don't leave a stack trace, particularly in ASAN builds. With a partial debug build, I get a stack trace with over 17,000 stack frames: