Open Dr15Jones opened 1 year ago
assign core
@wddgit FYI
New categories assigned: core
@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @Dr15Jones Chris Jones.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
The sequence should be:
If the Service summaries are printing in endJob, this looks like intentional Framework behavior. If the summaries are printing in the Service destructors, then something is wrong in the Framework. The last IOVs get ended by a sentry that goes out of scope just after endJob in cmsRun.cpp. The failure does seem to be occurring while waiting for the last IOVs to end. I don't know why. Maybe a tensorflow problem???
I see tensorflow::CancellationManager::StartCancel()
makes use of TensorFlow's mutex
https://github.com/tensorflow/tensorflow/blob/v2.6.4/tensorflow/core/framework/cancellation.cc#L38-L82
that we've seen before to be unreliable on ARM
I'm not sure of the relevance for this problem though, as the stack strace shows only one thread in TensorFlow. But maybe there were multiple earlier that lead to corruption of CancellationManager
's state?
What level concurrency do we have in the IOV cleanup?
I'll look more carefully tomorrow. Based on a quick look and my memory of how this works, the IOVs are all ending concurrently.
I don't recall there even being anything to wait until an IOV actually ends except that future IOVs will not start until there are open slots, so in theory there could be multiple IOVs ending per record although typically only the last IOV would be open for each record the vast majority of the time.
After looking more carefully, I do not see any Framework problems in this job. The summary messages are all printing in endJob and then the IOVs are closing. The Framework is waiting for the IOVs to close when the seg fault occurs. The Framework seems to be behaving correctly.
The IOVs close concurrently. I can see that from the code and also in the stack traces we can see the IOV cache objects being deleted at the same time on multiple threads (~CaloTPGTranscoderULUT, ~TfGraphDefWrapper and std::default_delete
I have not looked in tensorflow now (or ever). Matti, I won't look in there unless you ask me to. I've got no expertise in tensorflow. It seems curious startCancel appears twice at the end of the stack trace but it is possible that is normal. Maybe it is the mutex you mentioned... I don't know.
Another comment unrelated to this issue. In this job, there is only 1 luminosity block so there is only 1 IOV per record. So what I am about to say is not relevant for this seg fault. But when we migrated to using group I just noticed there were changes in how we handle waiting for the IOVs to end. The current code correctly waits for the last IOV for each record to end, but it no longer waits for the other IOVs to end. Practically it is probably exceedingly rare, but I see nothing to prevent the next to last IOV from still being in the process of ending when the wait ends. Before PR #32804 we were waiting for them all with waitForIOVsInFlight_, but we don't anymore. To hit a problem with this would require multiple IOVs in flight and the next to last IOV for a record would still be in the processing of ending when the wait ended and then bad things might happen. Probably this should be fixed, although the probability might be so low that the problem does not occur practically. Or maybe I am missing something...
Thanks David.
I have not looked in tensorflow now (or ever). Matti, I won't look in there unless you ask me to. I've got no expertise in tensorflow. It seems curious startCancel appears twice at the end of the stack trace but it is possible that is normal. Maybe it is the mutex you mentioned... I don't know.
No need. What I looked the TF code, the CancellationManager
can contain other CancellationManager
objects as well, and the StartCancel()
call propagates to the contained objects.
Probably this should be fixed, although the probability might be so low that the problem does not occur practically. Or maybe I am missing something...
@Dr15Jones Could you take a look? We tend to be eventually hitting into rare problems, so if there is a chance for misbehavior, I'd like to get it fixed.
Actually, there could be different IOVs for beginRun, beginLumi and endRun. So it is within the realm of possibility this is the problem. I don't know what establishes the IOV start and end for this record.
Ignore the last comment, the IOV for that record is run 1 to the end of time. There should only be 1 IOV.
Crash in CMSSW_13_1_X_2023-03-12-2300 11605.0 step 3 has an interesting stack trace
Thread 5 (Thread 0x400074a09250 (LWP 912275) "cmsRun"):
#3 0x000040002814a1ec in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000400059a7eda4 in nsync::nsync_dll_splice_after_(nsync::nsync_dll_element_s_*, nsync::nsync_dll_element_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#6 0x0000400059a7edcc in nsync::nsync_dll_make_first_in_list_(nsync::nsync_dll_element_s_*, nsync::nsync_dll_element_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#7 0x0000400059a7ee0c in nsync::nsync_dll_make_last_in_list_(nsync::nsync_dll_element_s_*, nsync::nsync_dll_element_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#8 0x0000400059a7effc in nsync::nsync_mu_lock_slow_(nsync::nsync_mu_s_*, nsync::waiter*, unsigned int, nsync::lock_type_s*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#9 0x0000400059a7f0ec in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#10 0x000040006a5361c8 in tensorflow::CancellationManager::StartCancel() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_framework.so.2
#11 0x000040006a5364d8 in tensorflow::CancellationManager::StartCancel() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_framework.so.2
#12 0x000040005945ed24 in tensorflow::DirectSession::Close() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#13 0x000040004fa8a850 in tensorflow::closeSession(tensorflow::Session*&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libPhysicsToolsTensorFlow.so
#14 0x000040004fa8d808 in TfGraphDefWrapper::~TfGraphDefWrapper() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libPhysicsToolsTensorFlow.so
#15 0x000040006ea0b5f8 in edm::eventsetup::CallbackProxy<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<TfGraphDefProducer, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> >, TfGraphRecord, edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> >(TfGraphDefProducer*, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> > (TfGraphDefProducer::*)(TfGraphRecord const&), edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> const&, edm::es::Label const&)::{lambda(TfGraphRecord const&)#1}, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> >, TfGraphRecord, edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> >, TfGraphRecord, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginPhysicsToolsTensorFlowPlugins.so
#16 0x0000400020bfd1ec in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#17 0x0000400020bfd24c in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#18 0x0000400020b89dc0 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#19 0x00004000225f2a64 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x4000d5ac9b00, waiter=<synthetic pointer>..., this=0x400023423f00) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (waiter=<synthetic pointer>..., t=0x0, this=0x400023423f00) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::arena::process (tls=..., this=0x400023423780) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/arena.cpp:137
#22 tbb::detail::r1::market::process (this=0x40002344b080, j=...) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/market.cpp:599
#23 0x00004000225fac5c in tbb::detail::r1::rml::private_worker::run (this=0x400023b8c000) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#24 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x400023b8c000) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#25 0x0000400022b078b8 in start_thread () from /lib64/libpthread.so.0
#26 0x0000400022b63afc in thread_start () from /lib64/libc.so.6
Thread 3 (Thread 0x400070e19250 (LWP 912268) "cmsRun"):
#2 0x00004000281469cc in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x000040002213ac88 in util_prefetch_write (ptr=<optimized out>) at include/jemalloc/internal/util.h:101
#5 util_prefetch_write_range (sz=1112, ptr=<optimized out>) at include/jemalloc/internal/util.h:117
#6 tcache_bin_flush_metadata_visitor (alloc_ctx=<synthetic pointer>, szind_sum_ctx=<synthetic pointer>) at src/tcache.c:257
#7 emap_edata_lookup_batch (result=result@entry=0x4000723c02c0, metadata_visitor_ctx=<synthetic pointer>, metadata_visitor=<optimized out>, ptr_getter_ctx=ptr_getter_ctx@entry=0x4000723c0280, ptr_getter=<optimized out>, nptrs=nptrs@entry=256, emap=<optimized out>, tsd=tsd@entry=0x400070e18528) at include/jemalloc/internal/emap.h:353
#8 tcache_bin_flush_edatas_lookup (tsd=tsd@entry=0x400070e1ee40, arr=arr@entry=0x0, nflush=nflush@entry=32, edatas=edatas@entry=0x400070e18460, binind=1178501120) at src/tcache.c:288
#9 0x000040002213b444 in tcache_bin_flush_impl (small=true, nflush=32, ptrs=0x0, binind=1178501120, cache_bin=0x400070e1f1c8, tcache=0x0, tsd=0x20) at src/tcache.c:331
#10 tcache_bin_flush_bottom (small=<optimized out>, rem=<optimized out>, binind=<optimized out>, cache_bin=<optimized out>, tcache=<optimized out>, tsd=tsd@entry=0x20) at src/tcache.c:519
#11 je_tcache_bin_flush_small (tsd=tsd@entry=0x400070e1ee40, tcache=0x0, cache_bin=0x400070e1f1c8, binind=4294538056, rem=<optimized out>) at src/tcache.c:529
#12 0x00004000220f0bdc in tcache_dalloc_small (slow_path=false, binind=<optimized out>, ptr=0x4001ccb1e000, tcache=<optimized out>, tsd=0x400070e1ee40) at include/jemalloc/internal/tcache_inlines.h:157
#13 arena_sdalloc (slow_path=<optimized out>, caller_alloc_ctx=<optimized out>, tcache=<optimized out>, size=<optimized out>, ptr=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:418
#14 isdalloct (slow_path=<optimized out>, alloc_ctx=<optimized out>, tcache=<optimized out>, size=<optimized out>, ptr=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:133
#15 isfree (slow_path=false, tcache=<optimized out>, usize=<optimized out>, ptr=0x4001ccb1e000, tsd=0x400070e1ee40) at src/jemalloc.c:2982
#16 je_sdallocx_default (ptr=0x4001ccb1e000, size=<optimized out>, flags=<optimized out>) at src/jemalloc.c:3924
#17 0x0000400022141b90 in sizedDeleteImpl (size=<optimized out>, ptr=<optimized out>) at src/jemalloc_cpp.cpp:195
#18 operator delete (ptr=<optimized out>, size=<optimized out>) at src/jemalloc_cpp.cpp:200
#19 0x000040004a343ab0 in CaloTPGTranscoderULUT::~CaloTPGTranscoderULUT() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libCalibCalorimetryCaloTPG.so
#20 0x000040004a343b84 in CaloTPGTranscoderULUT::~CaloTPGTranscoderULUT() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libCalibCalorimetryCaloTPG.so
#21 0x000040004a312f58 in edm::eventsetup::CallbackProxy<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<CaloTPGTranscoderULUTs, std::unique_ptr<CaloTPGTranscoder, std::default_delete<CaloTPGTranscoder> >, CaloTPGRecord, edm::eventsetup::CallbackSimpleDecorator<CaloTPGRecord> >(CaloTPGTranscoderULUTs*, std::unique_ptr<CaloTPGTranscoder, std::default_delete<CaloTPGTranscoder> > (CaloTPGTranscoderULUTs::*)(CaloTPGRecord const&), edm::eventsetup::CallbackSimpleDecorator<CaloTPGRecord> const&, edm::es::Label const&)::{lambda(CaloTPGRecord const&)#1}, std::unique_ptr<CaloTPGTranscoder, std::default_delete<CaloTPGTranscoder> >, CaloTPGRecord, edm::eventsetup::CallbackSimpleDecorator<CaloTPGRecord> >, CaloTPGRecord, std::unique_ptr<CaloTPGTranscoder, std::default_delete<CaloTPGTranscoder> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginCalibCalorimetryCaloTPGPlugins.so
#22 0x0000400020bfd1ec in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#23 0x0000400020bfd24c in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#24 0x0000400020b89dc0 in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#25 0x00004000225f2a64 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x4000d5aca100, waiter=<synthetic pointer>..., this=0x400023423e80) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/task_dispatcher.h:322
#26 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (waiter=<synthetic pointer>..., t=0x0, this=0x400023423e80) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/task_dispatcher.h:458
#27 tbb::detail::r1::arena::process (tls=..., this=0x400023423780) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/arena.cpp:137
#28 tbb::detail::r1::market::process (this=0x40002344b080, j=...) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/market.cpp:599
#29 0x00004000225fac5c in tbb::detail::r1::rml::private_worker::run (this=0x400023b8c080) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/private_server.cpp:271
#30 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x400023b8c080) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-d1db7ed5e1d50722d8d27a149fd6e0b9/tbb-v2021.8.0/src/tbb/private_server.cpp:221
#31 0x0000400022b078b8 in start_thread () from /lib64/libpthread.so.0
#32 0x0000400022b63afc in thread_start () from /lib64/libc.so.6
Thread 1 (Thread 0x40002277ca30 (LWP 909123) "cmsRun"):
#0 0x0000400022be5934 in nanosleep () from /lib64/libc.so.6
#1 0x0000400022be57d8 in sleep () from /lib64/libc.so.6
#2 0x00004000281469cc in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x0000400020a18dbc in do_lookup_x (undef_name=undef_name@entry=0x40006ef4293f "_ZN6HepPDT18ResonanceStructureD1Ev", new_hash=new_hash@entry=1954549647, old_hash=old_hash@entry=0xffffe2beb918, ref=0x40006ef407e0, result=result@entry=0xffffe2beb928, scope=<optimized out>, i=463, version=version@entry=0x0, flags=flags@entry=5, skip=<optimized out>, skip@entry=0x0, type_class=type_class@entry=1, undef_map=undef_map@entry=0x400036a80000) at dl-lookup.c:384
#5 0x0000400020a19748 in _dl_lookup_symbol_x (undef_name=0x40006ef4293f "_ZN6HepPDT18ResonanceStructureD1Ev", undef_map=undef_map@entry=0x400036a80000, ref=ref@entry=0xffffe2beb9b0, symbol_scope=0x400036a80398, version=0x0, type_class=type_class@entry=1, flags=5, skip_map=skip_map@entry=0x0) at dl-lookup.c:855
#6 0x0000400020a1f130 in _dl_fixup (l=0x400036a80000, reloc_arg=816) at dl-runtime.c:94
#7 0x0000400020a110c4 in _dl_runtime_resolve () at ../sysdeps/aarch64/dl-trampoline.S:99
#8 0x000040006ef4dfec in std::default_delete<HepPDT::ParticleDataTable>::operator()(HepPDT::ParticleDataTable*) const [clone .part.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginSimGeneralHepPDTESSource.so
#9 0x000040006ef4dfec in std::default_delete<HepPDT::ParticleDataTable>::operator()(HepPDT::ParticleDataTable*) const [clone .part.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/week0/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-12-2300/lib/el8_aarch64_gcc11/pluginSimGeneralHepPDTESSource.so
#10 0x00004000d5c07d60 in ?? ()
hinting more towards a problem in TensorFlow's nsync mutex (although I find this crash to be a bit weird way for it to manifest itself).
Crash in CMSSW_13_1_X_2023-04-18-2300 workflow 136.859 step 3
19-Apr-2023 03:53:14 CEST Closed file file:step2.root
Thread 5 (Thread 0x40007dfc9250 (LWP 2368260) "cmsRun"):
#4 0x00004000571ab544 in _ZN5boost11multi_index6detail18ordered_index_implINS0_6memberIN22EcalElectronicsMapping7MapItemE5DetIdXadL_ZNS5_4cellEEEEESt4lessIS6_ENS1_9nth_layerILi1ES5_NS0_10indexed_byINS0_14ordered_uniqueIS7_N4mpl_2naESE_EENSC_INS3_IS5_17EcalElectronicsIdXadL_ZNS5_4elidEEEEESE_SE_EENSC_INS3_IS5_24EcalTriggerElectronicsIdXadL_ZNS5_6trelidEEEEESE_SE_EENS0_18ordered_non_uniqueINS0_13const_mem_funIS5_iXadL_ZNKS5_5dccIdEvEEEESE_SE_EENSM_INS0_13composite_keyIS5_SO_NSN_IS5_iXadL_ZNKS5_7towerIdEvEEEENS_6tuples9null_typeEST_ST_ST_ST_ST_ST_ST_EESE_SE_EENSM_INSQ_IS5_SO_SR_NSN_IS5_iXadL_ZNKS5_7stripIdEvEEEEST_ST_ST_ST_ST_ST_ST_EESE_SE_EENSM_INSN_IS5_iXadL_ZNKS5_5tccIdEvEEEESE_SE_EENSM_INSQ_IS5_SZ_NSN_IS5_iXadL_ZNKS5_4ttIdEvEEEEST_ST_ST_ST_ST_ST_ST_ST_EESE_SE_EENSM_INSQ_IS5_SZ_S11_NSN_IS5_iXadL_ZNKS5_13pseudoStripIdEvEEEEST_ST_ST_ST_ST_ST_ST_EESE_SE_EESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaIS5_EEENS_3mpl7vector0ISE_EENS1_18ordered_unique_tagENS1_19null_augment_policyEE16delete_all_nodesEPNS1_18ordered_index_nodeIS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1_15index_node_baseIS5_S18_EEEEEEEEEEEEEEEEEEEE.isra.0 () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginGeometryEcalMappingPlugins.so
#5 0x00004000571ab668 in _ZN5boost11multi_index6detail18ordered_index_implINS0_6memberIN22EcalElectronicsMapping7MapItemE5DetIdXadL_ZNS5_4cellEEEEESt4lessIS6_ENS1_9nth_layerILi1ES5_NS0_10indexed_byINS0_14ordered_uniqueIS7_N4mpl_2naESE_EENSC_INS3_IS5_17EcalElectronicsIdXadL_ZNS5_4elidEEEEESE_SE_EENSC_INS3_IS5_24EcalTriggerElectronicsIdXadL_ZNS5_6trelidEEEEESE_SE_EENS0_18ordered_non_uniqueINS0_13const_mem_funIS5_iXadL_ZNKS5_5dccIdEvEEEESE_SE_EENSM_INS0_13composite_keyIS5_SO_NSN_IS5_iXadL_ZNKS5_7towerIdEvEEEENS_6tuples9null_typeEST_ST_ST_ST_ST_ST_ST_EESE_SE_EENSM_INSQ_IS5_SO_SR_NSN_IS5_iXadL_ZNKS5_7stripIdEvEEEEST_ST_ST_ST_ST_ST_ST_EESE_SE_EENSM_INSN_IS5_iXadL_ZNKS5_5tccIdEvEEEESE_SE_EENSM_INSQ_IS5_SZ_NSN_IS5_iXadL_ZNKS5_4ttIdEvEEEEST_ST_ST_ST_ST_ST_ST_ST_EESE_SE_EENSM_INSQ_IS5_SZ_S11_NSN_IS5_iXadL_ZNKS5_13pseudoStripIdEvEEEEST_ST_ST_ST_ST_ST_ST_EESE_SE_EESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaIS5_EEENS_3mpl7vector0ISE_EENS1_18ordered_unique_tagENS1_19null_augment_policyEE16delete_all_nodesEPNS1_18ordered_index_nodeIS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1G_IS1E_NS1_15index_node_baseIS5_S18_EEEEEEEEEEEEEEEEEEEE.isra.0 () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginGeometryEcalMappingPlugins.so
#6 0x00004000571adb60 in edm::eventsetup::CallbackProxy<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<EcalElectronicsMappingBuilder, std::unique_ptr<EcalElectronicsMapping, std::default_delete<EcalElectronicsMapping> >, EcalMappingRcd, edm::eventsetup::CallbackSimpleDecorator<EcalMappingRcd> >(EcalElectronicsMappingBuilder*, std::unique_ptr<EcalElectronicsMapping, std::default_delete<EcalElectronicsMapping> > (EcalElectronicsMappingBuilder::*)(EcalMappingRcd const&), edm::eventsetup::CallbackSimpleDecorator<EcalMappingRcd> const&, edm::es::Label const&)::{lambda(EcalMappingRcd const&)#1}, std::unique_ptr<EcalElectronicsMapping, std::default_delete<EcalElectronicsMapping> >, EcalMappingRcd, edm::eventsetup::CallbackSimpleDecorator<EcalMappingRcd> >, EcalMappingRcd, std::unique_ptr<EcalElectronicsMapping, std::default_delete<EcalElectronicsMapping> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginGeometryEcalMappingPlugins.so
#7 0x000040002b74d68c in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#8 0x000040002b74d6ec in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
Thread 4 (Thread 0x40007d5b9250 (LWP 2368259) "cmsRun"):
#2 0x0000400034482f88 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x000040002d6b37e4 in syscall () from /lib64/libc.so.6
#5 0x000040002d1397b8 in tbb::detail::r1::futex_wait (comparand=2, futex=0x40003331c124) at /data/cmsbuild/jenkins_c/workspace/auto-builds/CMSSW_13_1_0_pre3-el8_aarch64_gcc11/build/CMSSW_13_1_0_pre3-build/BUILD/el8_aarch64_gcc11/external/tbb/v2021.8.0-8f30f4fc8c5b3860b0ce8f2b70736d15/tbb-v2021.8.0/src/tbb/semaphore.h:103
Thread 3 (Thread 0x40007cba9250 (LWP 2368258) "cmsRun"):
#3 0x0000400034487888 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000400061ffec98 in nsync::nsync_dll_splice_after_(nsync::nsync_dll_element_s_*, nsync::nsync_dll_element_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#6 0x0000400061ffeccc in nsync::nsync_dll_make_first_in_list_(nsync::nsync_dll_element_s_*, nsync::nsync_dll_element_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#7 0x0000400061ffed0c in nsync::nsync_dll_make_last_in_list_(nsync::nsync_dll_element_s_*, nsync::nsync_dll_element_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#8 0x0000400061ffeefc in nsync::nsync_mu_lock_slow_(nsync::nsync_mu_s_*, nsync::waiter*, unsigned int, nsync::lock_type_s*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#9 0x0000400061ffefec in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#10 0x0000400072ab61c8 in tensorflow::CancellationManager::StartCancel() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_framework.so.2
#11 0x0000400072ab64d8 in tensorflow::CancellationManager::StartCancel() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_framework.so.2
#12 0x00004000619dec24 in tensorflow::DirectSession::Close() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-04-18-2300/external/el8_aarch64_gcc11/lib/libtensorflow_cc.so.2
#13 0x000040005800a4e0 in tensorflow::closeSession(tensorflow::Session*&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libPhysicsToolsTensorFlow.so
#14 0x000040005800d328 in TfGraphDefWrapper::~TfGraphDefWrapper() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libPhysicsToolsTensorFlow.so
#15 0x00004000770eb2f8 in edm::eventsetup::CallbackProxy<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<TfGraphDefProducer, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> >, TfGraphRecord, edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> >(TfGraphDefProducer*, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> > (TfGraphDefProducer::*)(TfGraphRecord const&), edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> const&, edm::es::Label const&)::{lambda(TfGraphRecord const&)#1}, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> >, TfGraphRecord, edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> >, TfGraphRecord, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginPhysicsToolsTensorFlowPlugins.so
#16 0x000040002b74d68c in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#17 0x000040002b74d6ec in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
Thread 1 (Thread 0x40002d29d090 (LWP 2365421) "cmsRun"):
#2 0x0000400034482f88 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 free_fastpath (size_hint=true, size=24, ptr=0x40017204fde0) at src/jemalloc.c:3097
#5 je_je_sdallocx_noflags (ptr=0x40017204fde0, size=24) at src/jemalloc.c:3950
#6 0x000040002cc91b90 in sizedDeleteImpl (size=<optimized out>, ptr=<optimized out>) at src/jemalloc_cpp.cpp:195
#7 operator delete (ptr=<optimized out>, size=<optimized out>) at src/jemalloc_cpp.cpp:200
#8 0x0000400076b99cf0 in edm::eventsetup::CallbackProxy<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<SiStripClusterizerConditionsESProducer, std::unique_ptr<SiStripClusterizerConditions, std::default_delete<SiStripClusterizerConditions> >, SiStripClusterizerConditionsRcd, edm::eventsetup::CallbackSimpleDecorator<SiStripClusterizerConditionsRcd> >(SiStripClusterizerConditionsESProducer*, std::unique_ptr<SiStripClusterizerConditions, std::default_delete<SiStripClusterizerConditions> > (SiStripClusterizerConditionsESProducer::*)(SiStripClusterizerConditionsRcd const&), edm::eventsetup::CallbackSimpleDecorator<SiStripClusterizerConditionsRcd> const&, edm::es::Label const&)::{lambda(SiStripClusterizerConditionsRcd const&)#1}, std::unique_ptr<SiStripClusterizerConditions, std::default_delete<SiStripClusterizerConditions> >, SiStripClusterizerConditionsRcd, edm::eventsetup::CallbackSimpleDecorator<SiStripClusterizerConditionsRcd> >, SiStripClusterizerConditionsRcd, std::unique_ptr<SiStripClusterizerConditions, std::default_delete<SiStripClusterizerConditions> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/pluginRecoLocalTrackerSiStripClusterizerPlugins.so
#9 0x000040002b74d68c in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#10 0x000040002b74d6ec in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02781/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-04-17-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
Current Modules:
Module: none (crashed)
Module: none
Module: none
Module: none
The job was shutting down because of an exception
----- Begin Fatal Exception 19-Apr-2023 03:52:43 CEST-----------------------
An exception of category 'PixelCPEClusterRepair::localError' occurred while
[0] Processing Event run: 315489 lumi: 1 event: 494169 stream: 1
[1] Running path 'dqmoffline_7_step'
[2] Prefetching for module SMPDQM/'SMPDQM'
[3] Prefetching for module MuonProducer/'muons'
[4] Prefetching for module MuonIdProducer/'muons1stStep'
[5] Prefetching for module HBHEIsolatedNoiseReflagger/'hbhereco@cpu'
[6] Prefetching for module TrackExtrapolator/'trackExtrapolator'
[7] Prefetching for module DuplicateListMerger/'generalTracks'
[8] Prefetching for module TrackProducer/'mergedDuplicateTracks'
[9] Prefetching for module DuplicateTrackMerger/'duplicateTrackCandidates'
[10] Prefetching for module TrackCollectionMerger/'preDuplicateMergingGeneralTracks'
[11] Prefetching for module TrackCollectionMerger/'earlyGeneralTracks'
[12] Calling method for module TrackProducer/'lowPtTripletStepTracks'
Exception Message:
ERROR: Negative pixel error yerr = -53690.9
----- End Fatal Exception -------------------------------------------------
Crash in CMSSW_13_3_X_2023-08-17-2300, RelVal 14234.0 step 2:
Thread 4 (Thread 0x4000858d9bc0 (LWP 536647) "cmsRun"):
#0 0x000040003b01f768 in poll () from /lib64/libc.so.6
#1 0x000040003cb7da6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x000040003cb4c708 in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x000040003cb47ec8 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x000040005b66fcc4 in tsl::CancellationManager::StartCancelWithStatus(tsl::Status const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/external/el9_aarch64_gcc11/lib/libtensorflow_cc.so.2
#6 0x000040005b670228 in tsl::CancellationManager::StartCancel() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/external/el9_aarch64_gcc11/lib/libtensorflow_cc.so.2
#7 0x00004000631efa64 in tensorflow::DirectSession::Close() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/external/el9_aarch64_gcc11/lib/libtensorflow_cc.so.2
#8 0x00004000579ea0bc in tensorflow::closeSession(tensorflow::Session*&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/libPhysicsToolsTensorFlow.so
#9 0x00004000579ec058 in TfGraphDefWrapper::~TfGraphDefWrapper() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/libPhysicsToolsTensorFlow.so
#10 0x000040008262b7c4 in edm::eventsetup::CallbackProductResolver<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<TfGraphDefProducer, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> >, TfGraphRecord, edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> >(TfGraphDefProducer*, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> > (TfGraphDefProducer::*)(TfGraphRecord const&), edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> const&, edm::es::Label const&)::{lambda(TfGraphRecord const&)#1}, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> >, TfGraphRecord, edm::eventsetup::CallbackSimpleDecorator<TfGraphRecord> >, TfGraphRecord, std::unique_ptr<TfGraphDefWrapper, std::default_delete<TfGraphDefWrapper> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/pluginPhysicsToolsTensorFlowPlugins.so
#11 0x0000400038e7ab4c in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/libFWCoreFramework.so
#12 0x0000400038e7abac in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/libFWCoreFramework.so
#13 0x0000400038e0967c in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02798/el9_aarch64_gcc11/cms/cmssw-patch/CMSSW_13_3_X_2023-08-17-2300/lib/el9_aarch64_gcc11/libFWCoreFramework.so
#14 0x000040003a832b88 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x4000aab7aa00, waiter=<synthetic pointer>..., this=0x40003b983e00) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el9_aarch64_gcc11/external/tbb/v2021.9.0-73c9534380ca142d041902611d608a2c/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#15 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (waiter=<synthetic pointer>..., t=0x0, this=0x40003b983e00) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el9_aarch64_gcc11/external/tbb/v2021.9.0-73c9534380ca142d041902611d608a2c/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#16 tbb::detail::r1::arena::process (tls=..., this=0x40003b983780) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el9_aarch64_gcc11/external/tbb/v2021.9.0-73c9534380ca142d041902611d608a2c/tbb-v2021.9.0/src/tbb/arena.cpp:137
#17 tbb::detail::r1::market::process (this=0x40003b9ab080, j=...) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el9_aarch64_gcc11/external/tbb/v2021.9.0-73c9534380ca142d041902611d608a2c/tbb-v2021.9.0/src/tbb/market.cpp:599
#18 0x000040003a83aec8 in tbb::detail::r1::rml::private_worker::run (this=0x40003c0ec100) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el9_aarch64_gcc11/external/tbb/v2021.9.0-73c9534380ca142d041902611d608a2c/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#19 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x40003c0ec100) at /data/cmsbld/jenkins_a/workspace/build-any-ib/w/BUILD/el9_aarch64_gcc11/external/tbb/v2021.9.0-73c9534380ca142d041902611d608a2c/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#20 0x000040003afc2a28 in start_thread () from /lib64/libc.so.6
#21 0x000040003af6bb9c in thread_start () from /lib64/libc.so.6
The following are the pruned thread stack traces for the failed job
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/el8_aarch64_gcc11/CMSSW_13_1_X_2023-03-07-2300/pyRelValMatrixLogs/run/11634.15_TTbar_14TeV+2021_JMENano/step3_TTbar_14TeV+2021_JMENano.log
This appears to be happening during the 'endJob' phase as many Service's have already reported their job summaries.
Notice that thread 1, 3 and 5 are all doing 'IOV cleanup' tasks. I would have thought those should have been completed in the 'end processing loop' stage?