cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.29k forks source link

[GPU] Illegal memory access in workflow 12434.523 (Patatrack+HCALOnly) #42523

Open makortel opened 1 year ago

makortel commented 1 year ago

Workflow 12434.523 step 2 crashed in CMSSW_13_3_GPU_X_2023-08-08-2300 on el8_amd64_gcc11 + NVIDIA A100-PCIE-40GB with

terminate called after throwing an instance of 'std::runtime_error'
  what():  
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/3fb10f2f057411d8a2d4a1a66a99843d/opt/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/src/CalibTracker/SiPixelESProducers/src/SiPixelGainCalibrationForHLTGPU.cc, line 80:
cudaCheck(cudaFree(gainForHLTonGPU));
cudaErrorIllegalAddress: an illegal memory access was encountered

Thread 7 (Thread 0x2b8085a03700 (LWP 28598) "cmsRun"):
#2  0x00002b7ffce1cb10 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  get_cie (f=<optimized out>) at ../../../libgcc/unwind-dw2-fde.h:157
#5  get_fde_encoding (f=0x2b7ff88a3598) at ../../../libgcc/unwind-dw2-fde.c:350
#6  _Unwind_IteratePhdrCallback (info=<optimized out>, size=<optimized out>, ptr=0x2b80859fd100) at ../../../libgcc/unwind-dw2-fde-dip.c:419
#7  0x00002b7ffae95b87 in dl_iterate_phdr () from /lib64/libc.so.6
#8  0x00002b7ffab06ed1 in _Unwind_Find_FDE (pc=0x2b7ff878d481 <edm::Path::workerFinished(std::__exception_ptr::exception_ptr const*, unsigned int, edm::EventTransitionInfo const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*, tbb::detail::d1::task_group&)+817>, bases=bases@entry=0x2b80859fd358) at ../../../libgcc/unwind-dw2-fde-dip.c:470
#9  0x00002b7ffab02fa8 in uw_frame_state_for (context=0x2b80859fd2b0, fs=0x2b80859fd3a0) at ../../../libgcc/unwind-dw2.c:1263
#10 0x00002b7ffab048e2 in _Unwind_RaiseException (exc=0x9e03ba20) at ../../../libgcc/unwind.inc:104
#11 0x00002b7ffa60d152 in std::rethrow_exception (ep=...) at ../../../../libstdc++-v3/libsupc++/eh_ptr.cc:212
#12 0x00002b7ff878d482 in edm::Path::workerFinished(std::__exception_ptr::exception_ptr const*, unsigned int, edm::EventTransitionInfo const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*, tbb::detail::d1::task_group&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#13 0x00002b7ff878d556 in edm::FunctorWaitingTask<edm::Path::runNextWorkerAsync(unsigned int, edm::EventTransitionInfo const&, edm::ServiceToken const&, edm::StreamID const&, edm::StreamContext const*, tbb::detail::d1::task_group&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#14 0x00002b7ff85c2f79 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreConcurrency.so

Thread 6 (Thread 0x2b8085002700 (LWP 28597) "cmsRun"):
#2  0x00002b7ffce1cb10 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  std::ostream::_M_insert<unsigned long long> (this=0x2b8084ffc4a0, __v=101) at /data/cmsbld/jenkins/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/gcc-11.4.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream.tcc:63
#5  0x00002b7ff893e36a in edm::operator<<(std::ostream&, edm::EventID const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libDataFormatsProvenance.so
#6  0x00002b7ff847b136 in edm::exceptionContext(std::ostream&, edm::StreamContext const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreServiceRegistry.so
#7  0x00002b7ff8478644 in edm::exceptionContext(cms::Exception&, edm::ModuleCallingContext const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreServiceRegistry.so
#8  0x00002b7ff86e4c16 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) [clone .cold] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9  0x00002b7ff8789948 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 5 (Thread 0x2b8084601700 (LWP 28596) "cmsRun"):
#2  0x00002b7ffce1cb10 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00002b7ffad7bd91 in sigprocmask () from /lib64/libc.so.6
#5  0x00002b7ffad4ee39 in abort () from /lib64/libc.so.6
#6  0x00002b7ffa60f0ea in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#7  0x00002b7ffa60d16a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#8  0x00002b7ffa60c1c9 in __cxa_call_terminate (ue_header=0xe2cba330) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#9  0x00002b7ffa60c8f7 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0xe2cba330, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:685
#10 0x00002b7ffab04384 in _Unwind_RaiseException_Phase2 (exc=0xe2cba330, context=0x2b80845fae10, frames_p=0x2b80845fad18) at ../../../libgcc/unwind.inc:64
#11 0x00002b7ffab04dbe in _Unwind_Resume (exc=0xe2cba330) at ../../../libgcc/unwind.inc:241
#12 0x00002b805063c768 in cms::cuda::abortOnCudaError(char const*, int, char const*, char const*, char const*, std::basic_string_view<char, std::char_traits<char> >) [clone .constprop.0] [clone .cold] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libHeterogeneousCoreCUDACore.so
#13 0x00002b805063d8e7 in cms::cuda::impl::ScopedContextHolderHelper::enqueueCallback(int, CUstream_st*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libHeterogeneousCoreCUDACore.so
#14 0x00002b805063d911 in cms::cuda::ScopedContextAcquire::~ScopedContextAcquire() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libHeterogeneousCoreCUDACore.so
#15 0x00002b80810f23c0 in HcalDigisProducerGPU::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) [clone .cold] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginEventFilterHcalRawToDigiGPUPlugins.so
#16 0x00002b7ff8814338 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#17 0x00002b7ff88184e2 in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#18 0x00002b7ff87f5111 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#19 0x00002b7ff87f5286 in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so
#20 0x00002b7ff878a42f in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreFramework.so

Thread 1 (Thread 0x2b7ffbb97580 (LWP 28430) "cmsRun"):
#3  0x00002b7ffce2034b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b7ffad7bacf in raise () from /lib64/libc.so.6
#6  0x00002b7ffad4eea5 in abort () from /lib64/libc.so.6
#7  0x00002b7ffa60194a in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00002b7ffa60d16a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00002b7ffa60c1c9 in __cxa_call_terminate (ue_header=0xdc722180) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00002b7ffa60c8f7 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0xdc722180, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:685
#11 0x00002b7ffab04384 in _Unwind_RaiseException_Phase2 (exc=0xdc722180, context=0x7ffde5491930, frames_p=0x7ffde5491838) at ../../../libgcc/unwind.inc:64
#12 0x00002b7ffab04dbe in _Unwind_Resume (exc=0xdc722180) at ../../../libgcc/unwind.inc:241
#13 0x00002b8081a7a239 in cms::cuda::abortOnCudaError(char const*, int, char const*, char const*, char const*, std::basic_string_view<char, std::char_traits<char> >) [clone .cold] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libCalibTrackerSiPixelESProducers.so
#14 0x00002b8081a81825 in SiPixelGainCalibrationForHLTGPU::GPUData::~GPUData() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libCalibTrackerSiPixelESProducers.so
#15 0x00002b8081a7a25b in std::vector<cms::cuda::ESProduct<SiPixelGainCalibrationForHLTGPU::GPUData>::Item, std::allocator<cms::cuda::ESProduct<SiPixelGainCalibrationForHLTGPU::GPUData>::Item> >::~vector() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libCalibTrackerSiPixelESProducers.so
#16 0x00002b8081a7a2f5 in SiPixelGainCalibrationForHLTGPU::SiPixelGainCalibrationForHLTGPU(SiPixelGainCalibrationForHLT const&, TrackerGeometry const&) [clone .cold] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libCalibTrackerSiPixelESProducers.so
#17 0x00002b80819ecfa5 in SiPixelGainCalibrationForHLTGPUESProducer::produce(SiPixelGainCalibrationForHLTGPURcd const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginCalibTrackerSiPixelESProducersPlugins.so
#18 0x00002b80819e952f in edm::eventsetup::CallbackBase<edm::ESProducer, edm::ESProducer::setWhatProduced<SiPixelGainCalibrationForHLTGPUESProducer, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >(SiPixelGainCalibrationForHLTGPUESProducer*, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> > (SiPixelGainCalibrationForHLTGPUESProducer::*)(SiPixelGainCalibrationForHLTGPURcd const&), edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> const&, edm::es::Label const&)::{lambda(SiPixelGainCalibrationForHLTGPURcd const&)#1}, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >::makeProduceTask<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<SiPixelGainCalibrationForHLTGPUESProducer, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >(SiPixelGainCalibrationForHLTGPUESProducer*, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> > (SiPixelGainCalibrationForHLTGPUESProducer::*)(SiPixelGainCalibrationForHLTGPURcd const&), edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> const&, edm::es::Label const&)::{lambda(SiPixelGainCalibrationForHLTGPURcd const&)#1}, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda(auto:1&&, auto:2&&, auto:3&&, auto:4&&)#1}::operator()<tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&>(tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&) const::{lambda(SiPixelGainCalibrationForHLTGPURcd const&)#1}>(tbb::detail::d1::task_group*, edm::ServiceWeakToken const&, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, bool, tbb::detail::d1::task_group*&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) const::{lambda()#2}::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginCalibTrackerSiPixelESProducersPlugins.so
#19 0x00002b80819e9738 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::CallbackBase<edm::ESProducer, edm::ESProducer::setWhatProduced<SiPixelGainCalibrationForHLTGPUESProducer, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >(SiPixelGainCalibrationForHLTGPUESProducer*, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> > (SiPixelGainCalibrationForHLTGPUESProducer::*)(SiPixelGainCalibrationForHLTGPURcd const&), edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> const&, edm::es::Label const&)::{lambda(SiPixelGainCalibrationForHLTGPURcd const&)#1}, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >::makeProduceTask<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<SiPixelGainCalibrationForHLTGPUESProducer, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >(SiPixelGainCalibrationForHLTGPUESProducer*, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> > (SiPixelGainCalibrationForHLTGPUESProducer::*)(SiPixelGainCalibrationForHLTGPURcd const&), edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> const&, edm::es::Label const&)::{lambda(SiPixelGainCalibrationForHLTGPURcd const&)#1}, std::unique_ptr<SiPixelGainCalibrationForHLTGPU, std::default_delete<SiPixelGainCalibrationForHLTGPU> >, SiPixelGainCalibrationForHLTGPURcd, edm::eventsetup::CallbackSimpleDecorator<SiPixelGainCalibrationForHLTGPURcd> >::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda(auto:1&&, auto:2&&, auto:3&&, auto:4&&)#1}::operator()<tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&>(tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&) const::{lambda(SiPixelGainCalibrationForHLTGPURcd const&)#1}>(tbb::detail::d1::task_group*, edm::ServiceWeakToken const&, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, bool, tbb::detail::d1::task_group*&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) const::{lambda()#2}>(tbb::detail::d1::task_group&, tbb::detail::d1::task_group*&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/pluginCalibTrackerSiPixelESProducersPlugins.so
#20 0x00002b7ff85c4099 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02797/el8_amd64_gcc11/cms/cmssw/CMSSW_13_3_GPU_X_2023-08-07-2300/lib/el8_amd64_gcc11/libFWCoreConcurrency.so

Current Modules:
Module: none (crashed)timeout: the monitored command dumped core

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_3_GPU_X_2023-08-08-2300/pyRelValMatrixLogs/run/12434.523_TTbar_14TeV+2023_Patatrack_HCALOnlyGPU_Validation/step2_TTbar_14TeV+2023_Patatrack_HCALOnlyGPU_Validation.log#/

makortel commented 1 year ago

assign reconstruction, heterogeneous

FYI @cms-sw/hcal-dpg-l2

cmsbuild commented 1 year ago

New categories assigned: heterogeneous,reconstruction

@fwyzard,@clacaputo,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 year ago

FYI @cms-sw/trk-dpg-l2

dan131riley commented 1 year ago

Threads 1 and 5 were both in abortOnCudaError() (and thread 7 was re-throwing something or another).

makortel commented 1 year ago

Threads 1 and 5 were both in abortOnCudaError() (and thread 7 was re-throwing something or another).

I'd bet the thread 7 was re-throwing the same exception, and the crash itself by a second exception during the stack unwinding due to the original exception, with no actual hint of where exactly the illegal memory access could have occurred.

jfernan2 commented 11 months ago

@abdoulline @igv4321 as HCAL reco contacts, any hint about this issue? Thank you in advance

fwyzard commented 11 months ago

is the problem reproducible ?