cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

[GPU] Segfault in HLTEcalRecHitInAllL1RegionsProducer::produce() #42524

Open makortel opened 1 year ago

makortel commented 1 year ago

Workflow 12434.593 step 2 segfaulted in CMSSW_13_3_GPU_X_2023-08-08-2300 on el8_amd64_gcc11 + NVIDIA A100-PCIE-40GB

Thread 7 (Thread 0x2ba7e247f700 (LWP 26162) "cmsRun"):
#3  0x00002ba75971134b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002ba802786be5 in HLTRecHitInAllL1RegionsProducer<EcalRecHit>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginRecoEgammaEgammaHLTProducersPlugins.so
#6  0x00002ba7552482bd in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#7  0x00002ba75522ea22 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 6 (Thread 0x2ba7e1a7e700 (LWP 26161) "cmsRun"):
#2  0x00002ba75970db10 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002ba75787ca5f in write () from /lib64/libc.so.6
#5  0x00002ba7577ee9ed in _IO_file_write@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6  0x00002ba7577edd5f in new_do_write () from /lib64/libc.so.6
#7  0x00002ba7577ef11e in __GI__IO_file_xsputn () from /lib64/libc.so.6
#8  0x00002ba7577e41ac in fwrite () from /lib64/libc.so.6
#9  0x00002ba7570c13c4 in std::basic_streambuf<char, std::char_traits<char> >::sputn (__n=1, __s=0x20d3538 "\n", this=<optimized out>) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/gcc-11.4.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/streambuf:455
#10 std::__ostream_write<char, std::char_traits<char> > (__n=1, __s=0x20d3538 "\n", __out=...) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/gcc-11.4.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:51
#11 std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x20d3538 "\n", __n=1) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/gcc-11.4.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:102
#12 0x00002ba758adf962 in edm::service::ELoutput::emitToken(std::basic_string_view<char, std::char_traits<char> >, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageService.so
#13 0x00002ba758ae0bf9 in edm::service::ELoutput::log(edm::ErrorObj const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageService.so
#14 0x00002ba758adfd05 in edm::service::ELadministrator::log(edm::ErrorObj&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageService.so
#15 0x00002ba758aeaed5 in edm::service::ThreadSafeLogMessageLoggerScribe::log(edm::ErrorObj*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageService.so
#16 0x00002ba758af20f3 in edm::service::ThreadSafeLogMessageLoggerScribe::runCommand(edm::MessageLoggerQ::OpCode, void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageService.so
#17 0x00002ba754fe29e8 in edm::MessageSender::ErrorObjDeleter::operator()(edm::ErrorObj*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageLogger.so
#18 0x00002ba754fe2ea1 in std::_Sp_counted_deleter<edm::ErrorObj*, edm::MessageSender::ErrorObjDeleter, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageLogger.so
#19 0x00002ba754fde60a in edm::MessageSender::~MessageSender() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreMessageLogger.so
#20 0x00002ba7e7b5fe31 in PFECALSuperClusterAlgo::buildSuperClusterMustacheOrBox(CalibratedPFCluster&, std::vector<CalibratedPFCluster, std::allocator<CalibratedPFCluster> >&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libRecoEcalEgammaClusterAlgos.so
#21 0x00002ba7e7b60fa7 in PFECALSuperClusterAlgo::buildAllSuperClustersMustacheOrBox(std::vector<CalibratedPFCluster, std::allocator<CalibratedPFCluster> >&, double) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libRecoEcalEgammaClusterAlgos.so
#22 0x00002ba7e7b610a8 in PFECALSuperClusterAlgo::run() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libRecoEcalEgammaClusterAlgos.so
#23 0x00002ba8071156e6 in PFECALSuperClusterProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginRecoEcalEgammaClusterProducers.so
#24 0x00002ba7552482bd in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#25 0x00002ba75522ea22 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 5 (Thread 0x2ba7e107d700 (LWP 26160) "cmsRun"):
#2  0x00002ba75970db10 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002ba7ae8535fb in DetId::subdetId() const [clone .isra.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libCondFormatsHcalObjects.so
#5  0x00002ba7ae7ac2cf in HcalCondObjectContainerBase::indexFor(DetId) const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libCondFormatsHcalObjects.so
#6  0x00002ba8052768d4 in HcalCondObjectContainer<HcalChannelStatus>::getValues(DetId, bool) const [clone .constprop.0] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginRecoLocalCaloCaloTowersCreator.so
#7  0x00002ba80525b55d in CaloTowersCreationAlgo::makeHcalDropChMap() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginRecoLocalCaloCaloTowersCreator.so
#8  0x00002ba805266a2c in CaloTowersCreator::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginRecoLocalCaloCaloTowersCreator.so
#9  0x00002ba7552482bd in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00002ba75522ea22 in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 1 (Thread 0x2ba7585c7580 (LWP 25985) "cmsRun"):
#2  0x00002ba75970db10 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007ffe7c34b6c2 in clock_gettime ()
#5  0x00002ba7578514da in clock_gettime@GLIBC_2.2.5 () from /lib64/libc.so.6
#6  0x00002ba75705f0c5 in std::chrono::_V2::steady_clock::now () at ../../../../../libstdc++-v3/src/c++11/chrono.cc:88
#7  0x00002ba7597304f1 in edm::service::Timing::postCommon() const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#8  0x00002ba7597306ab in edm::service::Timing::postModuleEvent(edm::StreamContext const&, edm::ModuleCallingContext const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#9  0x00002ba75523420b in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#10 0x00002ba75522df82 in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#11 0x00002ba7551b949a in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#12 0x00002ba7551b958d in std::__exception_ptr::exception_ptr edm::Worker::runModuleDirectly<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00002ba7551b96c2 in edm::Path::finished(std::__exception_ptr::exception_ptr, edm::StreamContext const*, edm::EventTransitionInfo const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00002ba7551bd346 in edm::Path::workerFinished(std::__exception_ptr::exception_ptr const*, unsigned int, edm::EventTransitionInfo const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*, tbb::detail::d1::task_group&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#15 0x00002ba7551bd556 in edm::FunctorWaitingTask<edm::Path::runNextWorkerAsync(unsigned int, edm::EventTransitionInfo const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*, tbb::detail::d1::task_group&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Current Modules:
Module: HLTEcalRecHitInAllL1RegionsProducer:hltRechitInRegionsECAL (crashed)
Module: PFECALSuperClusterProducer:hltParticleFlowSuperClusterECALL1Seeded
Module: PathStatusInserter:HLT_PFHT330PT30_QuadPFJet_75_60_45_40_v13
Module: CaloTowersCreator:hltTowerMakerForAllCPUOnly

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-08-2300/pyRelValMatrixLogs/run/12434.593_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation.log#/

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

assign reconstruction, hlt, heterogeneous

FYI @cms-sw/ecal-dpg-l2

cmsbuild commented 1 year ago

New categories assigned: heterogeneous,hlt,reconstruction

@missirol,@fwyzard,@clacaputo,@makortel,@mandrenguyen,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

missirol commented 1 year ago

Upon first look, I'm pretty confused. I ran wf 12434.593 with CMSSW_13_3_GPU_X_2023-08-08-2300 on gpu-c2a02-39-04.cms [0]: no crash, and no log-errors in step2.

This (no crash, no log-errors) seems to match the results of a previous GPU IB [1]. On the other hand, the crash in [2] comes after a long list of log-error messages such as [3].

[0]

CUDA runtime version 11.8, driver version 12.2, NVIDIA driver version 535.86.10
CUDA device 0: Tesla T4 (sm_75)

[1] https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-07-2300/pyRelValMatrixLogs/run/12434.593_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation.log [2] https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-08-2300/pyRelValMatrixLogs/run/12434.593_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_FullRecoGPU_Validation.log

[3]

%MSG-e EcalRecHitError:  EcalRecHitProducer:hltEcalRecHit  09-Aug-2023 09:49:08 CEST Run: 1 Event: 105
No intercalib const found for xtal 0! something wrong with EcalIntercalibConstants in your DB? 
%MSG
%MSG-e EcalLaserDbService:  EcalRecHitProducer:hltEcalRecHit  09-Aug-2023 09:49:08 CEST Run: 1 Event: 105
 DetId is NOT in ECAL
%MSG
makortel commented 1 year ago

I'd guess this is one of those random crashes. On a quick look I didn't see any relevant changes between CMSSW_13_3_GPU_X_2023-08-08-2300 and CMSSW_13_3_X_2023-08-07-2300 (in the latter all workflows succeeded). In CMSSW_13_3_X_2023-08-07-2300 the 12434.593 step 2 log did not contain the No intercalib const found messages.

thomreis commented 1 year ago

A detId 0 should not exist. But where it came from is hard to say if the crash is not reproducible.