cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

Relval 32034.0 step 3: SIGSEGV in ticl::ClusterFilterByAlgoAndSizeAndLayerRange::filter #46698

Closed iarspider closed 5 days ago

iarspider commented 1 week ago

Relval 32034.0 fails with SIGSEGV on step 3 in multiple IBs, e.g. link

Thread 5 (Thread 0x14df839ff700 (LWP 538630) "cmsRun"):
#0  0x000014dfdd1baac1 in poll () from /lib64/libc.so.6
#1  0x000014dfd6f81507 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x000014dfd6f81704 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014df26af67ef in ticl::ClusterFilterByAlgoAndSizeAndLayerRange::filter(std::vector<reco::CaloCluster, std::allocator<reco::CaloCluster> > const&, std::vector<float, std::allocator<float> >&, hgcal::RecHitTools&) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_2_CUDART_X_2024-11-13-2300/lib/el8_amd64_gcc12/pluginRecoHGCalTICLPlugins.so
#5  0x000014df26a4df39 in FilteredLayerClustersProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_14_2_CUDART_X_2024-11-13-2300/lib/el8_amd64_gcc12/pluginRecoHGCalTICLPlugins.so
#6  0x000014dfdfe56fc3 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#7  0x000014dfdfe3e53c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#8  0x000014dfdfdc2cf9 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#9  0x000014dfdfdc31f1 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/libFWCoreFramework.so
#10 0x000014dfe0010238 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02863/el8_amd64_gcc12/cms/cmssw/CMSSW_14_2_CUDART_X_2024-11-10-2300/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#11 0x000014dfdf18bb3b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x14dec7fcd300, waiter=..., this=0x14dfdbbc9500) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-1e6f6338afa444f41b680515f944103e/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x14dfdbbc9500) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-1e6f6338afa444f41b680515f944103e/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#13 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-1e6f6338afa444f41b680515f944103e/tbb-v2021.9.0/src/tbb/arena.cpp:137
#14 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-1e6f6338afa444f41b680515f944103e/tbb-v2021.9.0/src/tbb/market.cpp:599
#15 0x000014dfdf18dcee in tbb::detail::r1::rml::private_worker::run (this=0x14dfd77f4000) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-1e6f6338afa444f41b680515f944103e/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#16 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x14dfd77f4000) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-1e6f6338afa444f41b680515f944103e/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#17 0x000014dfdd4661ca in start_thread () from /lib64/libpthread.so.0
#18 0x000014dfdd0c18d3 in clone () from /lib64/libc.so.6
iarspider commented 1 week ago

assign RecoHGCal/TICL

cmsbuild commented 1 week ago

New categories assigned: reconstruction,upgrade

@jfernan2,@mandrenguyen,@Moanwar,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 week ago

cms-bot internal usage

cmsbuild commented 1 week ago

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

smuzaffar commented 1 week ago

@felicepantaleo , could this be due to https://github.com/cms-sw/cmssw/pull/46651 change?

felicepantaleo commented 1 week ago

I'll produce a larger statistics for these HFNose workflows and get back to you.

felicepantaleo commented 1 week ago

type HFNose

(not sure this type exists)

felicepantaleo commented 1 week ago

I suspect this is due to this mask being filled only inside the if statement:

https://github.com/cms-sw/cmssw/blob/master/RecoHGCal/TICL/plugins/TrackstersProducer.cc#L205-L210

To make sure I'll reproduce locally and apply the fix and make a PR.

felicepantaleo commented 1 week ago

I was able to reproduce and test the fix in #46709

felicepantaleo commented 1 week ago

@smuzaffar can we add 32034.0 to the wf tested in all PRs? It's a good wf to check whether we are breaking something inside the TICL core framework.

mmusich commented 1 week ago

can we add 32034.0 to the wf tested in all PRs?

if it's useful in general (even outside of PR tests, but also something that developers are recommended to check before submitting PRs),

https://github.com/cms-sw/cmssw/blob/15283d2354f36562a6adc97318849ece08035b5b/Configuration/PyReleaseValidation/scripts/runTheMatrix.py#L61

is a good place to start.

smuzaffar commented 1 week ago

@felicepantaleo , sure if this is useful to catch errors earlier then as @mmusich mentioned we can either add it in to short/limited matrix set or add it to https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L11C23-L11C29 for bot PR tests only. Feel to open PR for any one of these