Closed Dr15Jones closed 3 years ago
wrt the crashes in multiple threads, on arm64 it is possible to get asynchronous processor exceptions on stores, so it's not hard to get a race condition where multiple threads execute the same bad store within the async store window. We might have a race condition in TCling that's generating bad code, that could account for all the threads failing, but so far I haven't identified a candidate race condition.
Even with imprecise exceptions, I still have a hard time coming up with scenarios for segfaults that appear to be in sched_yield() or other syscalls (there was one today with a segfault apparently in clock_gettime()). I'd expect a normal segfault to be imprecise by only a small number of exceptions. There are more exotic scenario where a fault isn't detected until a dirty write cache flush, but I don't see how we would hit those. Maybe if there's a context switch?
@makortel The point of the test is to prove that there are no unnecessary duplications of histograms, buy reducing memory so much that it would run out if there is another set of duplicates.
I think I limited virtual memory because it worked more reliably with ulimit
in my tests (making sure the test does actually fail when more histograms are allocated). It is well possible that that causes problems on the "non-standard" systems; also the limit is a bit unsharp, maybe just bumping it up a few percent is enough.
Here is a crash from hlt_mc_GRun step2 addOn test. Interestingly all streams are running EventSetupRecordDataGetter:hltGetConditions
.
http://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-10-07-2300/addOnTests/logs/cmsDriver-hlt_mc_GRun_cmsRun__cvmfs_cms-ib.cern.ch_nweek-02649_slc7_aarch64_gcc820_cms_cmssw-patch_CMSSW_11_2_X_2020-10-07-2300_src_HLTrigger_Configuration_test_OnLine_HLT_.log
Thread 5 (Thread 0xffff52738460 (LWP 75809)):
#2 0x0000ffffadec4380 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x0000ffffafb86908 in sched_yield () from /lib64/libc.so.6
#5 0x0000ffffb012e8d4 in __TBB_Pause () at ../../include/tbb/tbb_machine.h:332
#6 tbb::internal::prolonged_pause () at ../../src/tbb/scheduler_common.h:322
#7 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffaf1dbe00, completion_ref_count=@0xffff4ba2cd28: 2, isolation=0) at ../../src/tbb/custom_scheduler.h:305
#8 0x0000ffffb012fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf1dbe00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#9 0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#10 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#11 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#13 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#14 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#15 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#16 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#17 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#18 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
Thread 4 (Thread 0xffff53168460 (LWP 75808)):
#2 0x0000ffffadec4380 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x0000ffffafb86908 in sched_yield () from /lib64/libc.so.6
#5 0x0000ffffb012e9fc in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffaf20be00, completion_ref_count=@0xffff4ba34428: 2, isolation=0) at ../../src/tbb/mailbox.h:225
#6 0x0000ffffb012fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf20be00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#7 0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#8 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#9 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#10 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#11 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#13 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#14 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#15 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#16 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
Thread 3 (Thread 0xffff53b78460 (LWP 75807)):
#2 0x0000ffffadec4380 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x0000ffffafe9a4ec in std::basic_streambuf<char, std::char_traits<char> >::xsgetn (this=0xffff53b76d50, __s=0xffff53b7688c "\222\376\377\377\340h\267S\377\377", __n=1) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:49
#5 0x0000ffffac8841a4 in boost::archive::basic_binary_iprimitive<eos::portable_iarchive, char, std::char_traits<char> >::load_binary(void*, unsigned long) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsAlignment.so
#6 0x0000ffffa745ffb4 in boost::enable_if<boost::is_integral<int>, void>::type eos::portable_iarchive::load<int>(int&, eos::portable_iarchive::dummy<2>) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsRunInfo.so
#7 0x0000ffffa74669e4 in boost::archive::detail::iserializer<eos::portable_iarchive, std::vector<int, std::allocator<int> > >::load_object_data(boost::archive::detail::basic_iarchive&, void*, unsigned int) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsRunInfo.so
#8 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#9 0x0000ffffa263363c in void GBRTreeD::serialize<eos::portable_iarchive>(eos::portable_iarchive&, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsEgammaObjects.so
#10 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#11 0x0000ffffa2637da4 in boost::archive::detail::iserializer<eos::portable_iarchive, std::vector<GBRTreeD, std::allocator<GBRTreeD> > >::load_object_data(boost::archive::detail::basic_iarchive&, void*, unsigned int) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsEgammaObjects.so
#12 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#13 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#14 0x0000ffff54c1241c in std::unique_ptr<GBRForestD, std::default_delete<GBRForestD> > cond::default_deserialize<GBRForestD>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cond::Binary const&, cond::Binary const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#15 0x0000ffff54c128dc in std::unique_ptr<GBRForestD, std::default_delete<GBRForestD> > cond::persistency::Session::fetchPayload<GBRForestD>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#16 0x0000ffff54c12c34 in cond::persistency::PayloadProxy<GBRForestD>::loadPayload() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#17 0x0000ffff54c0afb4 in DataProxy<GBRDWrapperRcd, GBRForestD, cond::DefaultInitializer<GBRForestD> >::prefetch(edm::eventsetup::DataKey const&, edm::EventSetupRecordDetails) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#18 0x0000ffffb1b22404 in edm::SerialTaskQueue::QueuedTask<edm::eventsetup::ESSourceDataProxyBase::prefetchAsyncImpl(edm::WaitingTask*, edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#19 0x0000ffffb012f648 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0xffffaf1e3e00, context_guard=..., t=t@entry=0xffff4b9ffb40, isolation=isolation@entry=0) at ../../include/tbb/machine/gcc_generic.h:101
#20 0x0000ffffb012f89c in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf1e3e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#21 0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#22 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#23 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#24 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#25 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#26 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#27 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#28 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#29 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#30 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
Thread 1 (Thread 0xffffaf3e0000 (LWP 160662)):
#3 0x0000ffffadec6194 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000ffffafb86908 in sched_yield () from /lib64/libc.so.6
#6 0x0000ffffb012e8d4 in __TBB_Pause () at ../../include/tbb/tbb_machine.h:332
#7 tbb::internal::prolonged_pause () at ../../src/tbb/scheduler_common.h:322
#8 tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffaf2ca600, completion_ref_count=@0xffff4bb90828: 2, isolation=0) at ../../src/tbb/custom_scheduler.h:305
#9 0x0000ffffb012fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf2ca600, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#10 0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#11 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#13 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#14 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#15 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#16 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#17 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#18 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#19 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
Interestingly all streams are running EventSetupRecordDataGetter:hltGetConditions.
That isn't a big surprise to me if they are all waiting on a 'long' running ES data product to be produced. What I find surprising is the segmentation fault is reported to come from sched_yield
! That's a first for me.
One item from the crash says we should update EventSetupRecordDataGetter
so that it uses ES consumes.
I'm not having any success replicating the TFormula crashes. Last year I ran a wf in a loop in gdb overnight with no failures, figured it was probably one of those heisenbugs that go away in gdb. Wrote a simple test program to stress multithreaded TFormula, copying PFRecoTauDiscriminationByIsolationContainer
which has a formula that is simply a constant; no crashes. So now I'm running wf 4.43 on techlab-arm64-thunderx-02 in a loop (no gdb); ran overnight with no crashes.
So I'm puzzled. Are all our aarch64 systems the same model of CPU? Can we tell which host a relval ran on (maybe I should hostname
to the stack trace output)?
(And...shortly after I posted this, it crashed!)
Is this something to be transformed in a bugreport for ROOT? With the effort Dan put in writing the standalone program, it should be easy to hand over...
The standalone program never reproduced the problem, and the full WF run manually crashes at a rate that seems inconsistent with the observed rate in the IBs. I suspect it is very timing/load dependent.
With PR #32810 I was able to poke around the stack and the segment map of one of these crashes, where the last two stack frames were
#5 0x0000ffff400000ac in ?? ()
#6 0x0000fffffdd9f0f0 in ?? ()
Looking at the segment map of the process, frame 5 was in 64kb of dirty (written to) memory not associated with any shared object, consistent with JITted code, and the area disassembled into reasonable looking ARM64 code. Frame 6 was on the stack, and the stack did look to have been scrambled. Setting the stack pointer to points in the stack that looked like stack frames, I did find somewhat valid stack frames for a call from PFEnergyCalibration::aEndcap(double)
, consistent with a crash in code generated via PerformancePayloadFromTFormula
. This is all looking very suggestive of a data race in the JIT code generation.
Continuing to poke at WF 136.776, I have a crash
#5 0x0000ffff40000094 in ?? ()
#6 0x0000ffff537dba74 in PerformancePayloadFromTFormula::getResult(PerformanceResult::ResultType, BinningPointByMap const&) const () from /cvmfs/cms-ib.cern.ch/week0/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libCondFormatsPhysicsToolsObjects.so
#7 0x0000ffff537dba74 in PerformancePayloadFromTFormula::getResult(PerformanceResult::ResultType, BinningPointByMap const&) const () from /cvmfs/cms-ib.cern.ch/week0/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libCondFormatsPhysicsToolsObjects.so
#8 0x0000ffff490fe790 in ?? ()
where the TFormula
is apparently very simple, the routine at the crash address disassembles to
0x0000ffff40000090: adrp x8, 0x1000036d00000
0x0000ffff40000094: ldr d0, [x8, #72] ; crash
0x0000ffff40000098: fmov d1, xzr
0x0000ffff4000009c: ldr d2, [x0]
0x0000ffff400000a0: fmul d1, d2, d1
0x0000ffff400000a4: fadd d0, d1, d0
0x0000ffff400000a8: ret
The adrp
instruction calculates a page-aligned PC-relative offset, and the following ldr
access an offset from that page-aligned address, and that's where it crashes. The offset looks suspicious, it's larger than the address range being used but it isn't negative. It might be 2**32 off of a reasonable address.
I have a similar crash on techlab-arm64-thunderx-02, WF 10809.0. The TFormula
is evidently more complicated, as the disassembly is quite a bit larger, but again the crash is in the preamble finding the data:
Thread 5 (Thread 0x3ff487c83c0 (LWP 11028)):
#4 <signal handler called>
#5 0x000003ff400000b8 in ?? ()
#6 0x000003ff487c7330 in ?? ()
#7 0x000c00120003305b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 4 (Thread 0x3ff491d83c0 (LWP 11027)):
#6 <signal handler called>
#7 0x000003ff400000b8 in ?? ()
#8 0x000003ff491d7330 in ?? ()
#9 0x000c00120003305b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 3 (Thread 0x3ff49be83c0 (LWP 11026)):
#6 <signal handler called>
#7 0x000003ff400000b8 in ?? ()
#8 0x000003ff49be7330 in ?? ()
#9 0x000c00120003305b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Thread 1 (Thread 0x3ffaf0a0000 (LWP 10771)):
#11 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
where disassembled we get:
0x000003ff400000a8: sub sp, sp, #0x60
0x000003ff400000ac: stp x29, x30, [sp, #80]
0x000003ff400000b0: add x29, sp, #0x50
0x000003ff400000b4: adrp x8, 0x40036640000
0x000003ff400000b8: ldr d0, [x8, #72] ;crash
and again the adrp
offset looks suspiciously larger than the address range in use. This could be a data race, or it could be an issue with whether or not thread 1 calls TFormula
first and how the address space gets laid out. I suspect the latter. TFormula
makes sure a fn only gets compiled once, so once it is mis-compiled all the threads that reach the fn segfault at the same place.
Neither of these reproduces in gdb, I can only get at them with the interactive-debug PR. gdb on 10809 on the techlab machine does reproduce crashes in
onnxruntime::SessionState::UpdateMemoryPatternGroupCache()
so it seems there's a real data race there.
I still see the crashes after updating to cms-root origin/cms/master/a001679, which has https://github.com/root-project/root/pull/6218 (I had assume that PR wouldn't fix these particular crashes, but it seemed like time to verify it--this did involve rebuilding ROOT, DD4Hep, and all of CMSSW). I'm now testing cmsRunGlibC, and in several hundred tries have not seen any TFormula
crashes, so it seems the problem is related to both multithreading and memory allocation.
I've got the simplest example yet, where the TFormula
is simply loading a constant:
0x0000ffff40000078: adrp x8, 0x1000035166000
0x0000ffff4000007c: ldr d0, [x8, #72]
0x0000ffff40000080: ret
The common theme in all of these is the appearance that somehow a calculation involving a memory address at 0x0000ffff...
has overflowed to 0x00010000...
, yielding an offset into unmapped space. I think it's time to get the cling experts involved.
I think it's time to get the cling experts involved.
Thanks Dan, I agree.
@dan131riley could you make a new issue about race condition in
onnxruntime::SessionState::UpdateMemoryPatternGroupCache()
?
Oh hey, bingo! Debug build gets an assertion failure in RuntimeDyldELF::resolveAArch64Relocation()
!
@vgvassilev can you take a look at this stack trace and suggest how to proceed? It's definitely wrapping around the uint64_t
:
(gdb) p/x Value
$2 = 0xfffe5e94726c
(gdb) p/x Addend
$3 = 0x0
(gdb) p/x FinalAddress
$4 = 0xffff20001470
(gdb) p/x Result
$5 = 0xffffffff3e945dfc
(gdb)
Thread 1 (Thread 0xffff86a33010 (LWP 3947823)):
#5 0x0000ffff86f15c1c in raise () from /lib64/libc.so.6
#6 0x0000ffff86f037a8 in abort () from /lib64/libc.so.6
#7 0x0000ffff86f0f2e8 in __assert_fail_base () from /lib64/libc.so.6
#8 0x0000ffff86f0f350 in __assert_fail () from /lib64/libc.so.6
#9 0x0000ffff3b57d8f0 in llvm::RuntimeDyldELF::resolveAArch64Relocation (this=0xffff4ee45400, Section=..., Offset=400, Value=281467973562988, Type=261, Addend=0) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:363
#10 0x0000ffff3b57fe90 in llvm::RuntimeDyldELF::resolveRelocation (this=0xffff4ee45400, Section=..., Offset=400, Value=281467973562988, Type=261, Addend=0, SymOffset=0, SectionID=3) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:895
#11 0x0000ffff3b57fd54 in llvm::RuntimeDyldELF::resolveRelocation (this=0xffff4ee45400, RE=..., Value=281467973562988) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:877
#12 0x0000ffff3b55d5c0 in llvm::RuntimeDyldImpl::resolveRelocationList (this=0xffff4ee45400, Relocs=..., Value=281467973562988) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:957
#13 0x0000ffff3b5596d0 in llvm::RuntimeDyldImpl::resolveRelocations (this=0xffff4ee45400) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:145
#14 0x0000ffff3b55e1f8 in llvm::RuntimeDyld::resolveRelocations (this=0xffffc631ce58) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1140
#15 0x0000ffff3b55e2e4 in llvm::RuntimeDyld::finalizeWithMemoryManagerLocking (this=0xffffc631ce58) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1158
#16 0x0000ffff3a084fc0 in llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}::operator()(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>) const (__closure=0xffff50367460, H=Python Exception <type 'exceptions.ValueError'> Cannot find type llvm::orc::RTDyldObjectLinkingLayerBase::ObjHandleT::_Node:
, RTDyld=..., ObjToLoad=std::shared_ptr<class llvm::object::OwningBinary<llvm::object::ObjectFile>> (use count 1, weak count 0) = {...}, LOSHandleLoad=...) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:274
#17 0x0000ffff3a0937e8 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::finalize() (this=0xffff4e64ed00) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:143
#18 0x0000ffff3a093870 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::getSymbolMaterializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda()#1}::operator()() const (this=0xffff4e64ed00) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:158
#19 0x0000ffff3a0940cc in std::_Function_handler<llvm::Expected<unsigned long> (), llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::getSymbolMaterializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:286
#20 0x0000ffff3a0781a8 in std::function<llvm::Expected<unsigned long> ()>::operator()() const (this=0xffffc631d000) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:688
#21 0x0000ffff3a0773d4 in llvm::JITSymbol::getAddress (this=0xffffc631d000) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:201
#22 0x0000ffff3a08a200 in llvm::orc::LazyEmittingLayer<llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler> >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler>&)::{lambda()#1}::operator()() const (this=0xffff50367240) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/LazyEmittingLayer.h:75
#23 0x0000ffff3a08e850 in std::_Function_handler<llvm::Expected<unsigned long> (), llvm::orc::LazyEmittingLayer<llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler> >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler>&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:286
#24 0x0000ffff3a0781a8 in std::function<llvm::Expected<unsigned long> ()>::operator()() const (this=0xffffc631d158) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:688
#25 0x0000ffff3a0773d4 in llvm::JITSymbol::getAddress (this=0xffffc631d158) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:201
#26 0x0000ffff3a077780 in cling::IncrementalJIT::getSymbolAddress (this=0xffff40418f00, Name="_GLOBAL__sub_I_cling_module_324", AlsoInProcess=false) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalJIT.h:194
#27 0x0000ffff3a0784ec in cling::IncrementalExecutor::jitInitOrWrapper<void (*)()> (this=0xffff405b1340, funcname=..., fun=@0xffffc631d2a0: 0x0) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.h:275
#28 0x0000ffff3a0779d0 in cling::IncrementalExecutor::executeInit (this=0xffff405b1340, function=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.h:265
#29 0x0000ffff3a076970 in cling::IncrementalExecutor::runStaticInitializersOnce (this=0xffff405b1340, T=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:262
#30 0x0000ffff39f6adc8 in cling::Interpreter::executeTransaction (this=0xffff402b1400, T=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:1691
#31 0x0000ffff3a09657c in cling::IncrementalParser::commitTransaction (this=0xffff40251c00, PRT=..., ClearDiagClient=true) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:613
#32 0x0000ffff3a096c70 in cling::IncrementalParser::Compile (this=0xffff40251c00, input=..., Opts=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:769
#33 0x0000ffff39f697c4 in cling::Interpreter::DeclareInternal (this=0xffff402b1400, input="\n#define __ROOTCLING__ 1\n#undef ClassDef\n#define ClassDef(name,id) \\\n_ClassDefOutline_(name,id,virtual,) \\\nstatic int DeclFileLine() { return __LINE__; }\n#undef ClassDefNV\n#define ClassDefNV(name, id)"..., CO=..., T=0x0) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:1338
#34 0x0000ffff39f68384 in cling::Interpreter::parseForModule (this=0xffff402b1400, input="\n#define __ROOTCLING__ 1\n#undef ClassDef\n#define ClassDef(name,id) \\\n_ClassDefOutline_(name,id,virtual,) \\\nstatic int DeclFileLine() { return __LINE__; }\n#undef ClassDefNV\n#define ClassDefNV(name, id)"...) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:922
#35 0x0000ffff39d86254 in ExecAutoParse (what=0xffff80ac0e38 "\n#line 1 \"DataFormatsTrackReco_xr dictionary payload\"\n\n#ifndef CMS_DICT_IMPL\n #define CMS_DICT_IMPL 1\n#endif\n#ifndef _REENTRANT\n #define _REENTRANT 1\n#endif\n#ifndef GNUSOURCE\n #define GNUSOURCE 1\n#"..., header=false, interpreter=0xffff402b1400) at /home/dsr/root/core/metacling/src/TCling.cxx:6232
#36 0x0000ffff39d86944 in TCling::AutoParseImplRecurse (this=0xffff40418b80, cls=0xffff4b3251a0 "vector<reco::Track>", topLevel=false) at /home/dsr/root/core/metacling/src/TCling.cxx:6337
#37 0x0000ffff39d86c30 in TCling::AutoParseImplRecurse (this=0xffff40418b80, cls=0xffff4b3d8400 "edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>", topLevel=true) at /home/dsr/root/core/metacling/src/TCling.cxx:6373
#38 0x0000ffff39d86f34 in TCling::AutoParse (this=0xffff40418b80, cls=0xffff4b3d8400 "edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>") at /home/dsr/root/core/metacling/src/TCling.cxx:6422
#39 0x0000ffff39d72034 in TClingLookupHelper__AutoParse (cname=0xffff4b3d8400 "edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>") at /home/dsr/root/core/metacling/src/TCling.cxx:900
#40 0x0000ffff39c1500c in ROOT::TMetaUtils::TClingLookupHelper::GetPartiallyDesugaredNameWithScopeHandling (this=0xffff44e45740, tname="edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>", result="", dropstd=true) at /home/dsr/root/core/clingutils/src/TClingUtils.cxx:626
#41 0x0000ffff87cb81d4 in TClassEdit::TSplitType::ShortType (this=0xffffc631e588, answ="", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:437
#42 0x0000ffff87cbb5b0 in TClassEdit::ShortType[abi:cxx11](char const*, int) (typeDesc=0xffff4b39a580 "edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> >", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:1292
#43 0x0000ffff87cb80dc in TClassEdit::TSplitType::ShortType (this=0xffffc631e6d8, answ="", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:429
#44 0x0000ffff87cbb5b0 in TClassEdit::ShortType[abi:cxx11](char const*, int) (typeDesc=0xffff4b3d84a0 "std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:1292
#45 0x0000ffff87cb80dc in TClassEdit::TSplitType::ShortType (this=0xffffc631e870, answ="", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:429
#46 0x0000ffff87cb94f0 in TClassEdit::GetNormalizedName (norm_name="", name="edm::AssociationVector<edm::RefToBaseProd<reco::Jet>,std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >,edm::Ref"...) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:851
#47 0x0000ffff87cdae18 in TClass::GetClass (name=0xffff4a5dcc40 "edm::AssociationVector<edm::RefToBaseProd<reco::Jet>,std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >,edm::Ref"..., load=true, silent=false, hint_pair_offset=0, hint_pair_size=0) at /home/dsr/root/core/meta/src/TClass.cxx:3032
#48 0x0000ffff87cdab14 in TClass::GetClass (name=0xffff4a5dcc40 "edm::AssociationVector<edm::RefToBaseProd<reco::Jet>,std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >,edm::Ref"..., load=true, silent=false) at /home/dsr/root/core/meta/src/TClass.cxx:2948
#49 0x0000ffff88e411d4 in edm::TypeWithDict::byName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreReflection.so
#50 0x0000ffff88e3d05c in edm::TypeWithDict::byName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreReflection.so
#51 0x0000ffff88f06e80 in edm::BranchDescription::initFromDictionary() () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libDataFormatsProvenance.so
#52 0x0000ffff88f08168 in edm::BranchDescription::BranchDescription(edm::BranchType const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, edm::Hash<1> const&, edm::TypeWithDict const&, bool, bool, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libDataFormatsProvenance.so
#53 0x0000ffff89440b18 in edm::ProductRegistryHelper::addToRegistry(__gnu_cxx::__normal_iterator<edm::ProductRegistryHelper::TypeLabelItem const*, std::vector<edm::ProductRegistryHelper::TypeLabelItem, std::allocator<edm::ProductRegistryHelper::TypeLabelItem> > > const&, __gnu_cxx::__normal_iterator<edm::ProductRegistryHelper::TypeLabelItem const*, std::vector<edm::ProductRegistryHelper::TypeLabelItem, std::allocator<edm::ProductRegistryHelper::TypeLabelItem> > > const&, edm::ModuleDescription const&, edm::ProductRegistry&, edm::ProductRegistryHelper*, bool) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#54 0x0000ffff8943f074 in edm::ProducerBase::registerProducts(edm::ProducerBase*, edm::ProductRegistry*, edm::ModuleDescription const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#55 0x0000ffff894e7cf0 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::registerProductsAndCallbacks(edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase> const*, edm::ProductRegistry*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#56 0x0000ffff2ae3ed70 in edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase>::registerProductsAndCallbacks(edm::ProductRegistry*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/pluginRecoBTagCombinedPlugins.so
#57 0x0000ffff894b4ebc in edm::Maker::makeModule(edm::MakeModuleParams const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&) const () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#58 0x0000ffff893fece0 in edm::Factory::makeModule(edm::MakeModuleParams const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&) const () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#59 0x0000ffff89410e9c in edm::ModuleRegistry::getModule(edm::MakeModuleParams const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#60 0x0000ffff894b7a4c in edm::WorkerRegistry::getWorker(edm::WorkerParams const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#61 0x0000ffff894b57cc in edm::WorkerManager::getWorker(edm::ParameterSet&, edm::ProductRegistry&, edm::PreallocationConfiguration const*, std::shared_ptr<edm::ProcessConfiguration const>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#62 0x0000ffff894b667c in edm::WorkerManager::addToUnscheduledWorkers(edm::ParameterSet&, edm::ProductRegistry&, edm::PreallocationConfiguration const*, std::shared_ptr<edm::ProcessConfiguration>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#63 0x0000ffff89495034 in edm::StreamSchedule::StreamSchedule(std::shared_ptr<edm::TriggerResultInserter>, std::vector<edm::propagate_const<std::shared_ptr<edm::PathStatusInserter> >, std::allocator<edm::propagate_const<std::shared_ptr<edm::PathStatusInserter> > > >&, std::vector<edm::propagate_const<std::shared_ptr<edm::EndPathStatusInserter> >, std::allocator<edm::propagate_const<std::shared_ptr<edm::EndPathStatusInserter> > > >&, std::shared_ptr<edm::ModuleRegistry>, edm::ParameterSet&, edm::service::TriggerNamesService const&, edm::PreallocationConfiguration const&, edm::ProductRegistry&, edm::BranchIDListHelper&, edm::ExceptionToActionTable const&, std::shared_ptr<edm::ActivityRegistry>, std::shared_ptr<edm::ProcessConfiguration>, bool, edm::StreamID, edm::ProcessContext const*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#64 0x0000ffff89476e34 in edm::Schedule::Schedule(edm::ParameterSet&, edm::service::TriggerNamesService const&, edm::ProductRegistry&, edm::BranchIDListHelper&, edm::ThinnedAssociationsHelper&, edm::SubProcessParentageHelper const*, edm::ExceptionToActionTable const&, std::shared_ptr<edm::ActivityRegistry>, std::shared_ptr<edm::ProcessConfiguration>, bool, edm::PreallocationConfiguration const&, edm::ProcessContext const*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#65 0x0000ffff89486a24 in edm::ScheduleItems::initSchedule(edm::ParameterSet&, bool, edm::PreallocationConfiguration const&, edm::ProcessContext const*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#66 0x0000ffff8939e494 in edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#67 0x0000ffff893a00d4 in edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#68 0x000000000040f4b8 in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#69 0x0000ffff874bbb10 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffc6321320, d=warning: RTTI symbol not found for class 'tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>'
...) at ../../src/tbb/arena.cpp:1105
#70 0x00000000004103ec in main::{lambda()#1}::operator()() const ()
#71 0x000000000040ee3c in main ()
(gdb)
@dan131riley, thanks for the ping.
Unfortunately we are not in a very favorable position. We know ROOT has some JIT issues on arm (cc: @axel-naumann). I found this issue submitted here dotnet/runtime#46881 which hints two things. First it seems that it is not due to our particular JIT setup and second, this will still persist in ROOT after the llvm-9 upgrade.
We should try fixing it ourselves or we should trying to work it around. Do we need dictionary support for FindUsingAdvance
, if not we can try removing it and see if we live another day.
On Mon I will contact the llvm JIT people to seek more guidance.
@vgvassilev :
CodeModel::Large
- and it seems we don't, for aarch64. Could you propose a patch for CMS to try that sets CodeModel::Large
also for aarch64?@Axel-Naumann, indeed worth trying.
@dan131riley, can you test:
diff --git a/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp b/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp
index 43b37154b5..93cabf7073 100644
--- a/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp
+++ b/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp
@@ -57,7 +57,7 @@ CreateHostTargetMachine(const clang::CompilerInstance& CI) {
// We have to use large code model for PowerPC64 because TOC and text sections
// can be more than 2GB apart.
-#if defined(__powerpc64__) || defined(__PPC64__)
+#if defined(__powerpc64__) || defined(__PPC64__) || defined(__aarch64__)
CodeModel::Model CMModel = CodeModel::Large;
#else
CodeModel::Model CMModel = CodeModel::JITDefault;
thanks @vgvassilev , I am testing the suggested change here https://github.com/cms-sw/root/pull/150
Unfortunately, that doesn't fix it. I still get assertion failures in a debug build (where I verified that the model was set to CodeModel::Large
), and crashes in a release build:
#5 0x0000ffff2000009c in ?? ()
#6 0x0000ffff30dd4650 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Dump of assembler code from 0xffff20000090 to 0xffff20000108:
0x0000ffff20000090: stp x29, x30, [sp, #-16]!
0x0000ffff20000094: mov x29, sp
0x0000ffff20000098: adrp x8, 0x100001f924000
0x0000ffff2000009c: ldr d0, [x8, #88]
0x0000ffff200000a0: ldr d1, [x0]
0x0000ffff200000a4: fneg d1, d1
0x0000ffff200000a8: fdiv d0, d1, d0
0x0000ffff200000ac: bl 0xffff200000d0
0x0000ffff200000b0: adrp x8, 0x100001f924000
0x0000ffff200000b4: ldr d1, [x8, #72]
0x0000ffff200000b8: adrp x8, 0x100001f924000
0x0000ffff200000bc: ldr d2, [x8, #80]
0x0000ffff200000c0: fmul d0, d0, d2
0x0000ffff200000c4: fadd d0, d0, d1
0x0000ffff200000c8: ldp x29, x30, [sp], #16
0x0000ffff200000cc: ret
@Axel-Naumann, indeed worth trying.
I think that patch doesn't actually change anything...for aarch64, the target description sets CodeModel::Large
if it was set to CodeModel::JITDefault
here:
Thank you for trying this out, Dan - this serves as input to Vassil's discussion with the JIT expert (Lang). (FYI Vassil, my current hypothesis is that this is a relocation that is meant to happen within the same code segment, explaining the smaller reloc size, but where our JIT has split relocation and target into different segments.)
All the crashes I've seen follow the same patter of an adrp/ldr
as the first or nearly the first thing in the function, including the simplest example
0x0000ffff40000078: adrp x8, 0x1000035166000
0x0000ffff4000007c: ldr d0, [x8, #72]
0x0000ffff40000080: ret
where the source for that routine is just a floating point constant.
I'm not very familiar with the LLVM structure, but from what I've looked at I'm guessing these come from AArch64FastISel::materializeFP()
when it materializes a floating point constant from the constant pool. If that's correct, then the problem is the placement of the constant pool, which seems to end up a little too far away, or a little too far away and in the wrong direction, while the codegen looks to assume that the constant pool is close by.
@dan131riley, since you seem to be having fun with llvm ;) -- can you also dump the relevant llvm::Module
. I wonder if we can convert it into something standalone and run it through lli
and reproduce the issue in isolation. That'd make it easier to understand and fix.
@dan131riley
I'm not very familiar with the LLVM structure, but from what I've looked at I'm guessing these come from AArch64FastISel::materializeFP() when it materializes a floating point constant from the constant pool. If that's correct, then the problem is the placement of the constant pool, which seems to end up a little too far away, or a little too far away and in the wrong direction, while the codegen looks to assume that the constant pool is close by.
The ADRP/LDR sequence should be able to reach +/-4Gb from the fixup page. If the memory manager were allocating sections independently I would expect occasional crashes when sections connected by fixups happen to be allocated out-of-range. As a stop-gap solution to this the RuntimeDyld::MemoryManager interface provides the needsToeserveAllocationSpace
and reserveAllocationSpace
methods. @vgvassilev pointed me to https://github.com/root-project/root/commit/a7b0b3e647409c7510b38198b08ff94fd079f857 -- It looks like that was attempting to implement those methods to address a similar problem, but I'm not sure it went far enough:
void reserveAllocationSpace(uintptr_t CodeSize, uint32_t CodeAlign,
uintptr_t RODataSize, uint32_t RODataAlign,
uintptr_t RWDataSize, uint32_t RWDataAlign) override {
m_Code.allocate(getExeMM(),CodeSize, CodeAlign, true, false);
m_ROData.allocate(getExeMM(),RODataSize, RODataAlign, false, true);
m_RWData.allocate(getExeMM(),RWDataSize, RWDataAlign, false, false);
m_jit.m_SectionsAllocatedSinceLastLoad.insert(m_Code.m_Start);
m_jit.m_SectionsAllocatedSinceLastLoad.insert(m_ROData.m_Start);
m_jit.m_SectionsAllocatedSinceLastLoad.insert(m_RWData.m_Start);
}
There are independent calls to some 'allocate' function here: Either a slab large enough to accommodate all JIT'd memory was allocated up front (in which case implementing reserveAllocationSpace
was redundant), or these are still separate allocation calls under the hood, in which case you probably still risk having them allocated out-of-range.
A canonical reserveAllocationSpace call looks more like:
void reserveAllocationSpace(uintptr_t CodeSize, uint32_t CodeAlign,
uintptr_t RODataSize, uint32_t RODataAlign,
uintptr_t RWDataSize, uint32_t RWDataAlign) override {
size_t TotalSize =
computeRequiredSize(CodeSize, CodeAlign,
RODataSize, RODataAlign,
RWDataSize, RWDataAlign);
CurrentSlab = reserve(TotalSize);
}
Then in allocateCodeSection / allocateDataSection you would return pointers into CurrentSlab.
FWIW the memory management APIs were redesigned to address this issue in JITLInk (LLVM's new JIT linker). The JITLinkMemoryManager interface requires all sections and sizes to be passed in one allocation call, making slab allocation for each object the natural default. JITLink also range checks all allocations and issues runtime errors with clean termination: You would have seen a "relocation target out of range" error with details on the target and fixup location, even in release builds.
There is no JITLink implementation for ELF / aarch64 yet, but we're not far off having one. JITLink for ELF / x86-64 is maturing quickly and aarch64 is the next natural target. This may make life easier in the future.
Thanks @lhames for the detailed explanation!
I am adding @pcanal who implemented this as part of a fix for ROOT-8523.
@lhames description makes sense to me too and the code seems indeed an improvement. I am not sure though whether this would be enough to address the current issue (which admittedly I am not well understanding). In the issue I addressed (if I recall correctly) the issue was mostly about the contiguous-ness of the code section. ... reading further... I see " the problem is the placement of the constant pool, which seems to end up a little too far away, or a little too far away and in the wrong direction,:" ... so indeed @lhames's further improvement would solve this.
The routine RPCSimSetUp::setRPCSetUp does a tremendous amount of output formatting which is then never seen because the resulting string is passed to LogDebug. See
I made PR #33071 to #ifdef EDM_ML_DEBUG
all the stringstream operations.
Would that fix the original problem -- https://github.com/root-project/root/pull/7419
Would that fix the original problem -- root-project/root#7419
I don't see why it should: The problem started before the LLVM upgrade and the linked PR disables GlobalISel which became enabled as a by-product of the upgrade. I think the initial issue described here is real and happens due to "circumstances" that make a code section more than 4Gb of virtual address space away from the data. Which is not something we can really avoid unless we statically pre-allocate memory for all sections that are ever going to be emitted by JIT. The nicer solution would be if AArch64 supports a relocation that can reference memory across the entire address space...
I agree that it's unlikely to help. What needs to be done is implement the suggestion by @lhames to allocate the data and code in one allocation so they are guaranteed to be close by. I don't know if this would solve every use case, but I'm confident it would fix all the CMS ones that I'm aware of. Is this on anyone's todo list?
Thanks for the explanation @hahnjo!
@dan131riley, I am not aware of being on anybody’s workplan.
Which is not something we can really avoid unless we statically pre-allocate memory for all sections that are ever going to be emitted by JIT.
In the newer (ORCv2) JIT design there are a couple of ways to approach this problem (and I'm posting here even though you're not on ORCv2 yet, since it touches on relevant topics):
In this case you can safely allocate on a per-object basis even with the small code model. References to variables outside the current object will go via a GOT entry (automatically optimized to direct reference if the external variable ends up being in-range of the JIT'd code), and calls to externals will go via a jump stub (automatically bypassed if the call target ends up being in-range).
In this case we're allowed to assume that the variable will be in the same JITDylib as the reference, which has different implications for address ranges depending on the code model:
Small code model: The code generator is allowed to assume that all references between code and data within the JITDylib can be expressed with direct PC-relative addressing. To satisfy this assumption the client must reserve sufficient address space (and you only need to reserve the address space, you can attach actual memory to it later as needed) for all JIT'd code and data on a per-JITDylib basis up-front. On a 64-bit system this is probably practical. On a 32-bit one it's harder and may become a serious constraint, especially if more than one JITDylib is required.
Large code model: The code generator cannot assume anything about the address-range of references between code and data within a JITDylib. All loads go via a GOT (or by splatting an immediate into register and loading from that), and all calls are typically indirect via a register. This saves the client from reserving address space up-front, at the cost of some runtime performance (due to the indirection and it's potential impacts on prediction and cache performance), and some link-time performance (more relocations may be required).
Large code model is a requirement for MCJIT / ORCv1, since RuntimeDyld never fully dealt with this problem, or implemented the full set of relocations required to support the small code model on key platform (e.g. arm64, x86-64).
The nicer solution would be if AArch64 supports a relocation that can reference memory across the entire address space...
In ORCv2 this is satisfied by the large code model above. Unfortunately MCJIT / ORCv1 adds an extra twist: Even in the large code model you can usually assume that local code and data within an object file are within range of one another, but the separation of allocateCodeSection / allocateDataSection in RuntimeDyld's memory manager make it possible (if you don't pre-reserve space for the whole object) to allocate code and data for a single object out-of-range of one another. Using the reserveAllocationSpace
trick fixes this for MCJIT / ORCv1.
Unfortunately MCJIT / ORCv1 adds an extra twist: Even in the large code model you can usually assume that local code and data within an object file are within range of one another, but the separation of
allocateCodeSection
/allocateDataSection
in RuntimeDyld's memory manager make it possible (if you don't pre-reserve space for the whole object) to allocate code and data for a single object out-of-range of one another. Using thereserveAllocationSpace
trick fixes this for MCJIT / ORCv1.
But it does not for incremental JITting, right? I'm thinking about declaring a large array in the first module that is referenced by code in the second object. Then we need to support arbitrary relocations into the entire address space (unless I'm missing something here).
Thanks a lot @lhames for this explanation!
FYI @dan131riley: @hahnjo will be looking into the object pre-reservation described by @lhames - hoping we can come up with a way to make it work (see his comment above). And IIUC @vgvassilev will tackle the upgrade to ORCv2 this year.
But it does not for incremental JITting, right? I'm thinking about declaring a large array in the first module that is referenced by code in the second object. Then we need to support arbitrary relocations into the entire address space (unless I'm missing something here).
You do need arbitrary relocations into the entire address space to solve this, but it turns out that we already generate them for data references (even in small code model) because regular dynamic linking introduces the same class of problem that you're describing: When you see a declaration like extern int X;
in C code there's nothing to tell the compiler/codegen whether X will eventually be part of the same library, or will come from some other dynamic library / shared object. For that reason, even in the small code model, codegen will generate a sequence like this:
movq X@GOT(%rip), %rax ; Materialize address of X into %rax by loading from a GOT entry
movl (%rax), %eax ; Indirectly load actual value of X (from address in %rax)
At link time if X turns out to be defined in your library then the static linker can rewrite this sequence to:
leaq X(%rip), %rax ; PC relative address calculation for X (fast)
movl (%rax), %eax ; Indirectly load actual value of X (from address in %rax)
On the other hand if X turns out not to be defined in your library then the linker synthesizes a GOT (Global Offset Table) entry pointing to X (and a dynamic fixup to patch that entry up at load time), and then you just load the address from the table entry.
The new JIT linker knows all these tricks (both GOT synthesis and how to optimize for in-range targets). The caveat that I drew attention to above is hidden externs. For a hidden extern under the small code model codegen is allowed to generate:
movl X(%rax), %eax ; Directly load X
There's simply no way to rewrite that to make it safe if X is out-of-range of a PC-relative reference from the movl instruction. That's why extern hiddens require you to preallocate address space slabs for whole JITDylibs at a time.
In any case the advice in ORCv2 is to pre-reserve address ranges if possible: It makes hidden externals work, but also guarantees that the range-based optimizations will always fire. If pre-reserving ranges is not possible that's ok too, but in that case you can't use hidden externals (If you do and they're allocated out of range you'll at least get a clean "out-of-range" error from the JIT linker, but the danger is that you'll get lucky a lot of the time and things will silently work right up until your luck runs out, probably in the middle of some critical job).
Okay, before continuing the technical discussion, I'd like to take a moment to make sure that everybody is talking about the same problem (because it looks like at least I didn't). I was able to produce a crash on AArch64 Linux with the following two lines in the interactive root
interpreter:
root [0] void *ptr = malloc(4L << 30);
root [1] ROOT::RDataFrame(1).Define("x0", "42").Define("x1", "42").Count().GetValue()
I think this backtrace matches https://github.com/cms-sw/cmssw/issues/31123#issuecomment-778318184 and my understanding is that it happens when we're loading libROOTDataFrame.so
and its dependencies. Maybe @Axel-Naumann can confirm? (or knowns how to) @dan131riley do you think this matches what is happening in CMSSW? (loading your own libraries, of course)
I think this backtrace matches #31123 (comment) and my understanding is that it happens when we're loading
libROOTDataFrame.so
and its dependencies.
That looks like the same issue, but it isn't during library loading, it's in the process of JITting
ROOT::RDataFrame(1).Define("x0", "42").Define("x1", "42").Count().GetValue()
which is all under
#26 0x0000ffffb56095fc in cling::Interpreter::process (this=0x4b65e0, input="#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", V=0xffffffffc860, T=0x0,
disableValuePrinting=false) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/Interpreter.cpp:817
My understanding is that, when that line is JITted, the memory for the constants and the code are allocated separately, while the compiler and runtime loader are assuming that the constants will be nearby the code--often I believe the constants will be allocated just before the code that references it. Allocating the code and constants independently can violate the assumption of addressability.
My understanding is that, when that line is JITted, the memory for the constants and the code are allocated separately, while the compiler and runtime loader are assuming that the constants will be nearby the code--often I believe the constants will be allocated just before the code that references it. Allocating the code and constants independently can violate the assumption of addressability.
That sounds right to me, and fits neatly with the backtrace @hahnjo linked. The discussion above is relevant -- the code examples were x86-64, but there are equivalent sequences and constraints exist for aarch64.
Right, I got confused by the auto-loading / -parsing frames in the original backtrace. However, I'm still not sure that the issue is really about constants and global data accesses - all these seem to respect the large code model that makes no assumptions about addressibility.
I took a closer look at the crash from my previous comment, and I'm 99% sure that this is coming from a relocation in .eh_frame
(allocated as data section) to a code section that is more than 4Gb away. I think there are two roads here:
.eh_frame
can cope with relocations across the entire address space and if so, why LLVM doesn't use that in the large code model.SectionMemoryManager
)
- Find out whether
.eh_frame
can cope with relocations across the entire address space and if so, why LLVM doesn't use that in the large code model.
Well, LLVM does for ppc64
and x86_64
but the case
s for aarch64
were missing. This was fixed by https://github.com/llvm/llvm-project/commit/18805ea951be02fcab6e7b11c3c7d929bcf1441a upstream and I've prepared a backport in https://github.com/root-project/root/pull/7563. This at least fixes the case I posted in https://github.com/cms-sw/cmssw/issues/31123#issuecomment-800450831. @dan131riley I would be super grateful if you could apply these two lines and test on your side (edit: or generally after the upgrade to LLVM 9, just to make sure it's not already fixed deeper down in the stack).
@hahnjo I can't reproduce your example https://github.com/cms-sw/cmssw/issues/31123#issuecomment-800450831 can you describe a bit further what machine have you used and which ROOT version ? I tried 6.22 and master on Arm
@hahnjo I can't reproduce your example #31123 (comment) can you describe a bit further what machine have you used and which ROOT version ? I tried 6.22 and master on Arm
Sure: I was building ROOT master on techlab-arm64-thunderx2-01
in full Debug mode (-DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug
) in order to get all assert
s. It doesn't really matter if you can reproduce my example (I mean, I made sure that this one is fixed), but the important data point would be if that also fixes the crash in CMSSW that @dan131riley was able to produce. If the crash is still there, it must be a different problem and I have to dig deeper (with an improved example for testing).
Right, thank you for the clarification. I'm using similar machine (prob almost the same) so I'm going to change the flags and retry, however ...
In CMSSW we are using 6.22 ROOT, not master, currently this one to be specific and I tried to reproduced your PR from master to 6.22 in here: https://github.com/cms-sw/root/commit/fae0f05c92383de4b8d98856444436d1234a8b78 If you confirm this is how the backport should look like I'll merge the change and build an integration build release (IB) so this change can be available in a release for convenience usage
@mrodozov the backport seems fine, but do you need a full release to test if the crash is gone? I had hoped that you have local development builds of CMS-SW, plus you really need a Debug build in order to see the assert
. (I plan to backport this to v6-22-patches
once it lands in ROOT master
, but I wanted confirmation that it really fixes the issue)
@hahnjo I do have a local dev area with a Debug ROOT that I can hook into a recent CMSSW IB, hopefully will get to trying your patch sometime today.
@hahnjo, the backport you suggest is in any case good to have. Why don’t we just go ahead and merge it and then cmssw can pick up the new master and will get an answer probably by tomorrow?
we are building this release anyway, it's just a minor edit for the Arm build and a few hours earlier. it's not only to check if the example crash is gone, I want to see if it fixes pieces in cmssw. and a release with debug flags for ROOT will also be helpful
It looks like the patch by @hahnjo does fix the problem. I ran around a hundred test jobs using the backport by @mrodozov in a Debug build, with no assertion failures or crashes observed.
It looks like the patch by @hahnjo does fix the problem. I ran around a hundred test jobs using the backport by @mrodozov in a Debug build, with no assertion failures or crashes observed.
Thanks for testing! I backported the fix for AArch64 and a similar commit for PowerPC to 6.22 in https://github.com/root-project/root/pull/7607 and the fix will also be included in 6.24. In case you see the issue come back in later testing, please ping me :smiley:
@dan131riley would you please share with us which workflows did you ran as examples that didn't crash after the last ROOT change, or maybe a workflow that you see in the IBs that doesn't fail anymore ?
After switching to run the IB RelVals using multiple threads, we are seeing 'random' crashes in the aarch64 builds.