cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

Probably thread related crashes in aarch64 IBs #31123

Closed Dr15Jones closed 3 years ago

Dr15Jones commented 4 years ago

After switching to run the IB RelVals using multiple threads, we are seeing 'random' crashes in the aarch64 builds.

dan131riley commented 4 years ago

wrt the crashes in multiple threads, on arm64 it is possible to get asynchronous processor exceptions on stores, so it's not hard to get a race condition where multiple threads execute the same bad store within the async store window. We might have a race condition in TCling that's generating bad code, that could account for all the threads failing, but so far I haven't identified a candidate race condition.

Even with imprecise exceptions, I still have a hard time coming up with scenarios for segfaults that appear to be in sched_yield() or other syscalls (there was one today with a segfault apparently in clock_gettime()). I'd expect a normal segfault to be imprecise by only a small number of exceptions. There are more exotic scenario where a fault isn't detected until a dirty write cache flush, but I don't see how we would hit those. Maybe if there's a context switch?

schneiml commented 4 years ago

@makortel The point of the test is to prove that there are no unnecessary duplications of histograms, buy reducing memory so much that it would run out if there is another set of duplicates.

I think I limited virtual memory because it worked more reliably with ulimit in my tests (making sure the test does actually fail when more histograms are allocated). It is well possible that that causes problems on the "non-standard" systems; also the limit is a bit unsharp, maybe just bumping it up a few percent is enough.

makortel commented 3 years ago

Here is a crash from hlt_mc_GRun step2 addOn test. Interestingly all streams are running EventSetupRecordDataGetter:hltGetConditions. http://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-10-07-2300/addOnTests/logs/cmsDriver-hlt_mc_GRun_cmsRun__cvmfs_cms-ib.cern.ch_nweek-02649_slc7_aarch64_gcc820_cms_cmssw-patch_CMSSW_11_2_X_2020-10-07-2300_src_HLTrigger_Configuration_test_OnLine_HLT_.log

Thread 5 (Thread 0xffff52738460 (LWP 75809)):
#2  0x0000ffffadec4380 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffafb86908 in sched_yield () from /lib64/libc.so.6
#5  0x0000ffffb012e8d4 in __TBB_Pause () at ../../include/tbb/tbb_machine.h:332
#6  tbb::internal::prolonged_pause () at ../../src/tbb/scheduler_common.h:322
#7  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffaf1dbe00, completion_ref_count=@0xffff4ba2cd28: 2, isolation=0) at ../../src/tbb/custom_scheduler.h:305
#8  0x0000ffffb012fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf1dbe00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#9  0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#10 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#11 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#13 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#14 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#15 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#16 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#17 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#18 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so

Thread 4 (Thread 0xffff53168460 (LWP 75808)):
#2  0x0000ffffadec4380 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffafb86908 in sched_yield () from /lib64/libc.so.6
#5  0x0000ffffb012e9fc in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffaf20be00, completion_ref_count=@0xffff4ba34428: 2, isolation=0) at ../../src/tbb/mailbox.h:225
#6  0x0000ffffb012fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf20be00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#7  0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#8  0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#9  0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#10 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#11 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#13 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#14 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#15 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#16 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so

Thread 3 (Thread 0xffff53b78460 (LWP 75807)):
#2  0x0000ffffadec4380 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffafe9a4ec in std::basic_streambuf<char, std::char_traits<char> >::xsgetn (this=0xffff53b76d50, __s=0xffff53b7688c "\222\376\377\377\340h\267S\377\377", __n=1) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:49
#5  0x0000ffffac8841a4 in boost::archive::basic_binary_iprimitive<eos::portable_iarchive, char, std::char_traits<char> >::load_binary(void*, unsigned long) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsAlignment.so
#6  0x0000ffffa745ffb4 in boost::enable_if<boost::is_integral<int>, void>::type eos::portable_iarchive::load<int>(int&, eos::portable_iarchive::dummy<2>) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsRunInfo.so
#7  0x0000ffffa74669e4 in boost::archive::detail::iserializer<eos::portable_iarchive, std::vector<int, std::allocator<int> > >::load_object_data(boost::archive::detail::basic_iarchive&, void*, unsigned int) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsRunInfo.so
#8  0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#9  0x0000ffffa263363c in void GBRTreeD::serialize<eos::portable_iarchive>(eos::portable_iarchive&, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsEgammaObjects.so
#10 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#11 0x0000ffffa2637da4 in boost::archive::detail::iserializer<eos::portable_iarchive, std::vector<GBRTreeD, std::allocator<GBRTreeD> > >::load_object_data(boost::archive::detail::basic_iarchive&, void*, unsigned int) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libCondFormatsEgammaObjects.so
#12 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#13 0x0000ffffa836769c in boost::archive::detail::basic_iarchive::load_object(void*, boost::archive::detail::basic_iserializer const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-10-07-2300/external/slc7_aarch64_gcc820/lib/libboost_serialization.so.1.72.0
#14 0x0000ffff54c1241c in std::unique_ptr<GBRForestD, std::default_delete<GBRForestD> > cond::default_deserialize<GBRForestD>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cond::Binary const&, cond::Binary const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#15 0x0000ffff54c128dc in std::unique_ptr<GBRForestD, std::default_delete<GBRForestD> > cond::persistency::Session::fetchPayload<GBRForestD>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#16 0x0000ffff54c12c34 in cond::persistency::PayloadProxy<GBRForestD>::loadPayload() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#17 0x0000ffff54c0afb4 in DataProxy<GBRDWrapperRcd, GBRForestD, cond::DefaultInitializer<GBRForestD> >::prefetch(edm::eventsetup::DataKey const&, edm::EventSetupRecordDetails) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginCondCoreEgammaPlugins.so
#18 0x0000ffffb1b22404 in edm::SerialTaskQueue::QueuedTask<edm::eventsetup::ESSourceDataProxyBase::prefetchAsyncImpl(edm::WaitingTask*, edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#19 0x0000ffffb012f648 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0xffffaf1e3e00, context_guard=..., t=t@entry=0xffff4b9ffb40, isolation=isolation@entry=0) at ../../include/tbb/machine/gcc_generic.h:101
#20 0x0000ffffb012f89c in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf1e3e00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#21 0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#22 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#23 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#24 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#25 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#26 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#27 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#28 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#29 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#30 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so

Thread 1 (Thread 0xffffaf3e0000 (LWP 160662)):
#3  0x0000ffffadec6194 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffffafb86908 in sched_yield () from /lib64/libc.so.6
#6  0x0000ffffb012e8d4 in __TBB_Pause () at ../../include/tbb/tbb_machine.h:332
#7  tbb::internal::prolonged_pause () at ../../src/tbb/scheduler_common.h:322
#8  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffaf2ca600, completion_ref_count=@0xffff4bb90828: 2, isolation=0) at ../../src/tbb/custom_scheduler.h:305
#9  0x0000ffffb012fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffaf2ca600, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#10 0x0000ffffb012ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffb1d34f20 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#11 0x0000ffffb1b0e648 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffffb1b7cff8 in edm::eventsetup::EventSetupRecordImpl::doGet(edm::eventsetup::DataKey const&, edm::EventSetupImpl const*, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#13 0x0000ffff55d50c70 in edm::EventSetupRecordDataGetter::doGet(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/pluginFWCoreModules.so
#14 0x0000ffffb1c67300 in edm::stream::EDAnalyzerAdaptorBase::doStreamBeginRun(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#15 0x0000ffffb1c41278 in edm::WorkerT<edm::stream::EDAnalyzerAdaptorBase>::implDoStreamBegin(edm::StreamID, edm::RunTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#16 0x0000ffffb1b52304 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#17 0x0000ffffb1b52524 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#18 0x0000ffffb1b5269c in edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#19 0x0000ffffb1b52770 in edm::FunctorTask<edm::Worker::doWorkNoPrefetchingAsync<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1> >(edm::WaitingTask*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::ServiceToken const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02649/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-10-04-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
Dr15Jones commented 3 years ago

Interestingly all streams are running EventSetupRecordDataGetter:hltGetConditions.

That isn't a big surprise to me if they are all waiting on a 'long' running ES data product to be produced. What I find surprising is the segmentation fault is reported to come from sched_yield! That's a first for me.

Dr15Jones commented 3 years ago

One item from the crash says we should update EventSetupRecordDataGetter so that it uses ES consumes.

dan131riley commented 3 years ago

I'm not having any success replicating the TFormula crashes. Last year I ran a wf in a loop in gdb overnight with no failures, figured it was probably one of those heisenbugs that go away in gdb. Wrote a simple test program to stress multithreaded TFormula, copying PFRecoTauDiscriminationByIsolationContainer which has a formula that is simply a constant; no crashes. So now I'm running wf 4.43 on techlab-arm64-thunderx-02 in a loop (no gdb); ran overnight with no crashes.

So I'm puzzled. Are all our aarch64 systems the same model of CPU? Can we tell which host a relval ran on (maybe I should hostname to the stack trace output)?

(And...shortly after I posted this, it crashed!)

dpiparo commented 3 years ago

Is this something to be transformed in a bugreport for ROOT? With the effort Dan put in writing the standalone program, it should be easy to hand over...

dan131riley commented 3 years ago

The standalone program never reproduced the problem, and the full WF run manually crashes at a rate that seems inconsistent with the observed rate in the IBs. I suspect it is very timing/load dependent.

dan131riley commented 3 years ago

With PR #32810 I was able to poke around the stack and the segment map of one of these crashes, where the last two stack frames were

#5  0x0000ffff400000ac in ?? ()
#6  0x0000fffffdd9f0f0 in ?? ()

Looking at the segment map of the process, frame 5 was in 64kb of dirty (written to) memory not associated with any shared object, consistent with JITted code, and the area disassembled into reasonable looking ARM64 code. Frame 6 was on the stack, and the stack did look to have been scrambled. Setting the stack pointer to points in the stack that looked like stack frames, I did find somewhat valid stack frames for a call from PFEnergyCalibration::aEndcap(double), consistent with a crash in code generated via PerformancePayloadFromTFormula. This is all looking very suggestive of a data race in the JIT code generation.

dan131riley commented 3 years ago

Continuing to poke at WF 136.776, I have a crash

#5  0x0000ffff40000094 in ?? ()
#6  0x0000ffff537dba74 in PerformancePayloadFromTFormula::getResult(PerformanceResult::ResultType, BinningPointByMap const&) const () from /cvmfs/cms-ib.cern.ch/week0/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libCondFormatsPhysicsToolsObjects.so
#7  0x0000ffff537dba74 in PerformancePayloadFromTFormula::getResult(PerformanceResult::ResultType, BinningPointByMap const&) const () from /cvmfs/cms-ib.cern.ch/week0/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libCondFormatsPhysicsToolsObjects.so
#8  0x0000ffff490fe790 in ?? ()

where the TFormula is apparently very simple, the routine at the crash address disassembles to

   0x0000ffff40000090:  adrp    x8, 0x1000036d00000
   0x0000ffff40000094:  ldr d0, [x8, #72]                            ; crash
   0x0000ffff40000098:  fmov    d1, xzr
   0x0000ffff4000009c:  ldr d2, [x0]
   0x0000ffff400000a0:  fmul    d1, d2, d1
   0x0000ffff400000a4:  fadd    d0, d1, d0
   0x0000ffff400000a8:  ret

The adrp instruction calculates a page-aligned PC-relative offset, and the following ldr access an offset from that page-aligned address, and that's where it crashes. The offset looks suspicious, it's larger than the address range being used but it isn't negative. It might be 2**32 off of a reasonable address.

I have a similar crash on techlab-arm64-thunderx-02, WF 10809.0. The TFormula is evidently more complicated, as the disassembly is quite a bit larger, but again the crash is in the preamble finding the data:

Thread 5 (Thread 0x3ff487c83c0 (LWP 11028)):
#4  <signal handler called>
#5  0x000003ff400000b8 in ?? ()
#6  0x000003ff487c7330 in ?? ()
#7  0x000c00120003305b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (Thread 0x3ff491d83c0 (LWP 11027)):
#6  <signal handler called>
#7  0x000003ff400000b8 in ?? ()
#8  0x000003ff491d7330 in ?? ()
#9  0x000c00120003305b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 0x3ff49be83c0 (LWP 11026)):
#6  <signal handler called>
#7  0x000003ff400000b8 in ?? ()
#8  0x000003ff49be7330 in ?? ()
#9  0x000c00120003305b in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0x3ffaf0a0000 (LWP 10771)):
#11 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

where disassembled we get:

   0x000003ff400000a8:  sub sp, sp, #0x60
   0x000003ff400000ac:  stp x29, x30, [sp, #80]
   0x000003ff400000b0:  add x29, sp, #0x50
   0x000003ff400000b4:  adrp    x8, 0x40036640000
   0x000003ff400000b8:  ldr d0, [x8, #72]                          ;crash

and again the adrp offset looks suspiciously larger than the address range in use. This could be a data race, or it could be an issue with whether or not thread 1 calls TFormula first and how the address space gets laid out. I suspect the latter. TFormula makes sure a fn only gets compiled once, so once it is mis-compiled all the threads that reach the fn segfault at the same place.

Neither of these reproduces in gdb, I can only get at them with the interactive-debug PR. gdb on 10809 on the techlab machine does reproduce crashes in

onnxruntime::SessionState::UpdateMemoryPatternGroupCache()

so it seems there's a real data race there.

dan131riley commented 3 years ago

I still see the crashes after updating to cms-root origin/cms/master/a001679, which has https://github.com/root-project/root/pull/6218 (I had assume that PR wouldn't fix these particular crashes, but it seemed like time to verify it--this did involve rebuilding ROOT, DD4Hep, and all of CMSSW). I'm now testing cmsRunGlibC, and in several hundred tries have not seen any TFormula crashes, so it seems the problem is related to both multithreading and memory allocation.

I've got the simplest example yet, where the TFormula is simply loading a constant:

   0x0000ffff40000078:  adrp    x8, 0x1000035166000
   0x0000ffff4000007c:  ldr d0, [x8, #72]
   0x0000ffff40000080:  ret

The common theme in all of these is the appearance that somehow a calculation involving a memory address at 0x0000ffff... has overflowed to 0x00010000..., yielding an offset into unmapped space. I think it's time to get the cling experts involved.

makortel commented 3 years ago

I think it's time to get the cling experts involved.

Thanks Dan, I agree.

Dr15Jones commented 3 years ago

@dan131riley could you make a new issue about race condition in

onnxruntime::SessionState::UpdateMemoryPatternGroupCache()

?

dan131riley commented 3 years ago

Oh hey, bingo! Debug build gets an assertion failure in RuntimeDyldELF::resolveAArch64Relocation()!

https://github.com/root-project/root/blob/f8efb11a51cbe5b5152ebef19a4f7b78744ca2fa/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp#L361-L367

@vgvassilev can you take a look at this stack trace and suggest how to proceed? It's definitely wrapping around the uint64_t:

(gdb) p/x Value
$2 = 0xfffe5e94726c
(gdb) p/x Addend
$3 = 0x0
(gdb) p/x FinalAddress
$4 = 0xffff20001470
(gdb) p/x Result
$5 = 0xffffffff3e945dfc
(gdb) 
Thread 1 (Thread 0xffff86a33010 (LWP 3947823)):
#5  0x0000ffff86f15c1c in raise () from /lib64/libc.so.6
#6  0x0000ffff86f037a8 in abort () from /lib64/libc.so.6
#7  0x0000ffff86f0f2e8 in __assert_fail_base () from /lib64/libc.so.6
#8  0x0000ffff86f0f350 in __assert_fail () from /lib64/libc.so.6
#9  0x0000ffff3b57d8f0 in llvm::RuntimeDyldELF::resolveAArch64Relocation (this=0xffff4ee45400, Section=..., Offset=400, Value=281467973562988, Type=261, Addend=0) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:363
#10 0x0000ffff3b57fe90 in llvm::RuntimeDyldELF::resolveRelocation (this=0xffff4ee45400, Section=..., Offset=400, Value=281467973562988, Type=261, Addend=0, SymOffset=0, SectionID=3) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:895
#11 0x0000ffff3b57fd54 in llvm::RuntimeDyldELF::resolveRelocation (this=0xffff4ee45400, RE=..., Value=281467973562988) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:877
#12 0x0000ffff3b55d5c0 in llvm::RuntimeDyldImpl::resolveRelocationList (this=0xffff4ee45400, Relocs=..., Value=281467973562988) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:957
#13 0x0000ffff3b5596d0 in llvm::RuntimeDyldImpl::resolveRelocations (this=0xffff4ee45400) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:145
#14 0x0000ffff3b55e1f8 in llvm::RuntimeDyld::resolveRelocations (this=0xffffc631ce58) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1140
#15 0x0000ffff3b55e2e4 in llvm::RuntimeDyld::finalizeWithMemoryManagerLocking (this=0xffffc631ce58) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1158
#16 0x0000ffff3a084fc0 in llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}::operator()(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>) const (__closure=0xffff50367460, H=Python Exception <type 'exceptions.ValueError'> Cannot find type llvm::orc::RTDyldObjectLinkingLayerBase::ObjHandleT::_Node: 
, RTDyld=..., ObjToLoad=std::shared_ptr<class llvm::object::OwningBinary<llvm::object::ObjectFile>> (use count 1, weak count 0) = {...}, LOSHandleLoad=...) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:274
#17 0x0000ffff3a0937e8 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::finalize() (this=0xffff4e64ed00) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:143
#18 0x0000ffff3a093870 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::getSymbolMaterializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda()#1}::operator()() const (this=0xffff4e64ed00) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:158
#19 0x0000ffff3a0940cc in std::_Function_handler<llvm::Expected<unsigned long> (), llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::getSymbolMaterializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:286
#20 0x0000ffff3a0781a8 in std::function<llvm::Expected<unsigned long> ()>::operator()() const (this=0xffffc631d000) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:688
#21 0x0000ffff3a0773d4 in llvm::JITSymbol::getAddress (this=0xffffc631d000) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:201
#22 0x0000ffff3a08a200 in llvm::orc::LazyEmittingLayer<llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler> >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler>&)::{lambda()#1}::operator()() const (this=0xffff50367240) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/LazyEmittingLayer.h:75
#23 0x0000ffff3a08e850 in std::_Function_handler<llvm::Expected<unsigned long> (), llvm::orc::LazyEmittingLayer<llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler> >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler>&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:286
#24 0x0000ffff3a0781a8 in std::function<llvm::Expected<unsigned long> ()>::operator()() const (this=0xffffc631d158) at /cvmfs/cms-ib.cern.ch/nweek-02666/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:688
#25 0x0000ffff3a0773d4 in llvm::JITSymbol::getAddress (this=0xffffc631d158) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:201
#26 0x0000ffff3a077780 in cling::IncrementalJIT::getSymbolAddress (this=0xffff40418f00, Name="_GLOBAL__sub_I_cling_module_324", AlsoInProcess=false) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalJIT.h:194
#27 0x0000ffff3a0784ec in cling::IncrementalExecutor::jitInitOrWrapper<void (*)()> (this=0xffff405b1340, funcname=..., fun=@0xffffc631d2a0: 0x0) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.h:275
#28 0x0000ffff3a0779d0 in cling::IncrementalExecutor::executeInit (this=0xffff405b1340, function=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.h:265
#29 0x0000ffff3a076970 in cling::IncrementalExecutor::runStaticInitializersOnce (this=0xffff405b1340, T=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:262
#30 0x0000ffff39f6adc8 in cling::Interpreter::executeTransaction (this=0xffff402b1400, T=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:1691
#31 0x0000ffff3a09657c in cling::IncrementalParser::commitTransaction (this=0xffff40251c00, PRT=..., ClearDiagClient=true) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:613
#32 0x0000ffff3a096c70 in cling::IncrementalParser::Compile (this=0xffff40251c00, input=..., Opts=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalParser.cpp:769
#33 0x0000ffff39f697c4 in cling::Interpreter::DeclareInternal (this=0xffff402b1400, input="\n#define __ROOTCLING__ 1\n#undef ClassDef\n#define ClassDef(name,id) \\\n_ClassDefOutline_(name,id,virtual,) \\\nstatic int DeclFileLine() { return __LINE__; }\n#undef ClassDefNV\n#define ClassDefNV(name, id)"..., CO=..., T=0x0) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:1338
#34 0x0000ffff39f68384 in cling::Interpreter::parseForModule (this=0xffff402b1400, input="\n#define __ROOTCLING__ 1\n#undef ClassDef\n#define ClassDef(name,id) \\\n_ClassDefOutline_(name,id,virtual,) \\\nstatic int DeclFileLine() { return __LINE__; }\n#undef ClassDefNV\n#define ClassDefNV(name, id)"...) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:922
#35 0x0000ffff39d86254 in ExecAutoParse (what=0xffff80ac0e38 "\n#line 1 \"DataFormatsTrackReco_xr dictionary payload\"\n\n#ifndef CMS_DICT_IMPL\n  #define CMS_DICT_IMPL 1\n#endif\n#ifndef _REENTRANT\n  #define _REENTRANT 1\n#endif\n#ifndef GNUSOURCE\n  #define GNUSOURCE 1\n#"..., header=false, interpreter=0xffff402b1400) at /home/dsr/root/core/metacling/src/TCling.cxx:6232
#36 0x0000ffff39d86944 in TCling::AutoParseImplRecurse (this=0xffff40418b80, cls=0xffff4b3251a0 "vector<reco::Track>", topLevel=false) at /home/dsr/root/core/metacling/src/TCling.cxx:6337
#37 0x0000ffff39d86c30 in TCling::AutoParseImplRecurse (this=0xffff40418b80, cls=0xffff4b3d8400 "edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>", topLevel=true) at /home/dsr/root/core/metacling/src/TCling.cxx:6373
#38 0x0000ffff39d86f34 in TCling::AutoParse (this=0xffff40418b80, cls=0xffff4b3d8400 "edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>") at /home/dsr/root/core/metacling/src/TCling.cxx:6422
#39 0x0000ffff39d72034 in TClingLookupHelper__AutoParse (cname=0xffff4b3d8400 "edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>") at /home/dsr/root/core/metacling/src/TCling.cxx:900
#40 0x0000ffff39c1500c in ROOT::TMetaUtils::TClingLookupHelper::GetPartiallyDesugaredNameWithScopeHandling (this=0xffff44e45740, tname="edm::refhelper::FindUsingAdvance<vector<reco::Track>,reco::Track>", result="", dropstd=true) at /home/dsr/root/core/clingutils/src/TClingUtils.cxx:626
#41 0x0000ffff87cb81d4 in TClassEdit::TSplitType::ShortType (this=0xffffc631e588, answ="", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:437
#42 0x0000ffff87cbb5b0 in TClassEdit::ShortType[abi:cxx11](char const*, int) (typeDesc=0xffff4b39a580 "edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> >", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:1292
#43 0x0000ffff87cb80dc in TClassEdit::TSplitType::ShortType (this=0xffffc631e6d8, answ="", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:429
#44 0x0000ffff87cbb5b0 in TClassEdit::ShortType[abi:cxx11](char const*, int) (typeDesc=0xffff4b3d84a0 "std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:1292
#45 0x0000ffff87cb80dc in TClassEdit::TSplitType::ShortType (this=0xffffc631e870, answ="", mode=3618) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:429
#46 0x0000ffff87cb94f0 in TClassEdit::GetNormalizedName (norm_name="", name="edm::AssociationVector<edm::RefToBaseProd<reco::Jet>,std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >,edm::Ref"...) at /home/dsr/root/core/foundation/src/TClassEdit.cxx:851
#47 0x0000ffff87cdae18 in TClass::GetClass (name=0xffff4a5dcc40 "edm::AssociationVector<edm::RefToBaseProd<reco::Jet>,std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >,edm::Ref"..., load=true, silent=false, hint_pair_offset=0, hint_pair_size=0) at /home/dsr/root/core/meta/src/TClass.cxx:3032
#48 0x0000ffff87cdab14 in TClass::GetClass (name=0xffff4a5dcc40 "edm::AssociationVector<edm::RefToBaseProd<reco::Jet>,std::vector<edm::RefVector<std::vector<reco::Track>,reco::Track,edm::refhelper::FindUsingAdvance<std::vector<reco::Track>,reco::Track> > >,edm::Ref"..., load=true, silent=false) at /home/dsr/root/core/meta/src/TClass.cxx:2948
#49 0x0000ffff88e411d4 in edm::TypeWithDict::byName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreReflection.so
#50 0x0000ffff88e3d05c in edm::TypeWithDict::byName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreReflection.so
#51 0x0000ffff88f06e80 in edm::BranchDescription::initFromDictionary() () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libDataFormatsProvenance.so
#52 0x0000ffff88f08168 in edm::BranchDescription::BranchDescription(edm::BranchType const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, edm::Hash<1> const&, edm::TypeWithDict const&, bool, bool, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libDataFormatsProvenance.so
#53 0x0000ffff89440b18 in edm::ProductRegistryHelper::addToRegistry(__gnu_cxx::__normal_iterator<edm::ProductRegistryHelper::TypeLabelItem const*, std::vector<edm::ProductRegistryHelper::TypeLabelItem, std::allocator<edm::ProductRegistryHelper::TypeLabelItem> > > const&, __gnu_cxx::__normal_iterator<edm::ProductRegistryHelper::TypeLabelItem const*, std::vector<edm::ProductRegistryHelper::TypeLabelItem, std::allocator<edm::ProductRegistryHelper::TypeLabelItem> > > const&, edm::ModuleDescription const&, edm::ProductRegistry&, edm::ProductRegistryHelper*, bool) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#54 0x0000ffff8943f074 in edm::ProducerBase::registerProducts(edm::ProducerBase*, edm::ProductRegistry*, edm::ModuleDescription const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#55 0x0000ffff894e7cf0 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::registerProductsAndCallbacks(edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase> const*, edm::ProductRegistry*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#56 0x0000ffff2ae3ed70 in edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase>::registerProductsAndCallbacks(edm::ProductRegistry*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/pluginRecoBTagCombinedPlugins.so
#57 0x0000ffff894b4ebc in edm::Maker::makeModule(edm::MakeModuleParams const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&) const () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#58 0x0000ffff893fece0 in edm::Factory::makeModule(edm::MakeModuleParams const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&) const () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#59 0x0000ffff89410e9c in edm::ModuleRegistry::getModule(edm::MakeModuleParams const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&, edm::signalslot::Signal<void (edm::ModuleDescription const&)>&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#60 0x0000ffff894b7a4c in edm::WorkerRegistry::getWorker(edm::WorkerParams const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#61 0x0000ffff894b57cc in edm::WorkerManager::getWorker(edm::ParameterSet&, edm::ProductRegistry&, edm::PreallocationConfiguration const*, std::shared_ptr<edm::ProcessConfiguration const>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#62 0x0000ffff894b667c in edm::WorkerManager::addToUnscheduledWorkers(edm::ParameterSet&, edm::ProductRegistry&, edm::PreallocationConfiguration const*, std::shared_ptr<edm::ProcessConfiguration>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#63 0x0000ffff89495034 in edm::StreamSchedule::StreamSchedule(std::shared_ptr<edm::TriggerResultInserter>, std::vector<edm::propagate_const<std::shared_ptr<edm::PathStatusInserter> >, std::allocator<edm::propagate_const<std::shared_ptr<edm::PathStatusInserter> > > >&, std::vector<edm::propagate_const<std::shared_ptr<edm::EndPathStatusInserter> >, std::allocator<edm::propagate_const<std::shared_ptr<edm::EndPathStatusInserter> > > >&, std::shared_ptr<edm::ModuleRegistry>, edm::ParameterSet&, edm::service::TriggerNamesService const&, edm::PreallocationConfiguration const&, edm::ProductRegistry&, edm::BranchIDListHelper&, edm::ExceptionToActionTable const&, std::shared_ptr<edm::ActivityRegistry>, std::shared_ptr<edm::ProcessConfiguration>, bool, edm::StreamID, edm::ProcessContext const*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#64 0x0000ffff89476e34 in edm::Schedule::Schedule(edm::ParameterSet&, edm::service::TriggerNamesService const&, edm::ProductRegistry&, edm::BranchIDListHelper&, edm::ThinnedAssociationsHelper&, edm::SubProcessParentageHelper const*, edm::ExceptionToActionTable const&, std::shared_ptr<edm::ActivityRegistry>, std::shared_ptr<edm::ProcessConfiguration>, bool, edm::PreallocationConfiguration const&, edm::ProcessContext const*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#65 0x0000ffff89486a24 in edm::ScheduleItems::initSchedule(edm::ParameterSet&, bool, edm::PreallocationConfiguration const&, edm::ProcessContext const*) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#66 0x0000ffff8939e494 in edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#67 0x0000ffff893a00d4 in edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) () from /home/dsr/CMSSW_11_3_X_2021-02-05-2300/lib/cc8_aarch64_gcc9/libFWCoreFramework.so
#68 0x000000000040f4b8 in tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>::operator()() const ()
#69 0x0000ffff874bbb10 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffc6321320, d=warning: RTTI symbol not found for class 'tbb::interface7::internal::delegated_function<main::{lambda()#1}::operator()() const::{lambda()#1} const, void>'
...) at ../../src/tbb/arena.cpp:1105
#70 0x00000000004103ec in main::{lambda()#1}::operator()() const ()
#71 0x000000000040ee3c in main ()
(gdb) 
vgvassilev commented 3 years ago

@dan131riley, thanks for the ping.

Unfortunately we are not in a very favorable position. We know ROOT has some JIT issues on arm (cc: @axel-naumann). I found this issue submitted here dotnet/runtime#46881 which hints two things. First it seems that it is not due to our particular JIT setup and second, this will still persist in ROOT after the llvm-9 upgrade.

We should try fixing it ourselves or we should trying to work it around. Do we need dictionary support for FindUsingAdvance, if not we can try removing it and see if we live another day.

On Mon I will contact the llvm JIT people to seek more guidance.

Axel-Naumann commented 3 years ago

@vgvassilev :

vgvassilev commented 3 years ago

@Axel-Naumann, indeed worth trying.

@dan131riley, can you test:

diff --git a/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp b/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp
index 43b37154b5..93cabf7073 100644
--- a/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp
+++ b/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp
@@ -57,7 +57,7 @@ CreateHostTargetMachine(const clang::CompilerInstance& CI) {

 // We have to use large code model for PowerPC64 because TOC and text sections
 // can be more than 2GB apart.
-#if defined(__powerpc64__) || defined(__PPC64__)
+#if defined(__powerpc64__) || defined(__PPC64__) || defined(__aarch64__)
   CodeModel::Model CMModel = CodeModel::Large;
 #else
   CodeModel::Model CMModel = CodeModel::JITDefault;
smuzaffar commented 3 years ago

thanks @vgvassilev , I am testing the suggested change here https://github.com/cms-sw/root/pull/150

dan131riley commented 3 years ago

Unfortunately, that doesn't fix it. I still get assertion failures in a debug build (where I verified that the model was set to CodeModel::Large), and crashes in a release build:

#5  0x0000ffff2000009c in ?? ()
#6  0x0000ffff30dd4650 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Dump of assembler code from 0xffff20000090 to 0xffff20000108:
   0x0000ffff20000090:  stp x29, x30, [sp, #-16]!
   0x0000ffff20000094:  mov x29, sp
   0x0000ffff20000098:  adrp    x8, 0x100001f924000
   0x0000ffff2000009c:  ldr d0, [x8, #88]
   0x0000ffff200000a0:  ldr d1, [x0]
   0x0000ffff200000a4:  fneg    d1, d1
   0x0000ffff200000a8:  fdiv    d0, d1, d0
   0x0000ffff200000ac:  bl  0xffff200000d0
   0x0000ffff200000b0:  adrp    x8, 0x100001f924000
   0x0000ffff200000b4:  ldr d1, [x8, #72]
   0x0000ffff200000b8:  adrp    x8, 0x100001f924000
   0x0000ffff200000bc:  ldr d2, [x8, #80]
   0x0000ffff200000c0:  fmul    d0, d0, d2
   0x0000ffff200000c4:  fadd    d0, d0, d1
   0x0000ffff200000c8:  ldp x29, x30, [sp], #16
   0x0000ffff200000cc:  ret
dan131riley commented 3 years ago

@Axel-Naumann, indeed worth trying.

I think that patch doesn't actually change anything...for aarch64, the target description sets CodeModel::Large if it was set to CodeModel::JITDefault here:

https://github.com/root-project/root/blob/6923ab3b730c4bb9c840de87b564654a72603763/interpreter/llvm/src/lib/Target/AArch64/MCTargetDesc/AArch64MCTargetDesc.cpp#L78-L98

Axel-Naumann commented 3 years ago

Thank you for trying this out, Dan - this serves as input to Vassil's discussion with the JIT expert (Lang). (FYI Vassil, my current hypothesis is that this is a relocation that is meant to happen within the same code segment, explaining the smaller reloc size, but where our JIT has split relocation and target into different segments.)

dan131riley commented 3 years ago

All the crashes I've seen follow the same patter of an adrp/ldr as the first or nearly the first thing in the function, including the simplest example

   0x0000ffff40000078:  adrp    x8, 0x1000035166000
   0x0000ffff4000007c:  ldr d0, [x8, #72]
   0x0000ffff40000080:  ret

where the source for that routine is just a floating point constant.

I'm not very familiar with the LLVM structure, but from what I've looked at I'm guessing these come from AArch64FastISel::materializeFP() when it materializes a floating point constant from the constant pool. If that's correct, then the problem is the placement of the constant pool, which seems to end up a little too far away, or a little too far away and in the wrong direction, while the codegen looks to assume that the constant pool is close by.

vgvassilev commented 3 years ago

@dan131riley, since you seem to be having fun with llvm ;) -- can you also dump the relevant llvm::Module. I wonder if we can convert it into something standalone and run it through lli and reproduce the issue in isolation. That'd make it easier to understand and fix.

lhames commented 3 years ago

@dan131riley

I'm not very familiar with the LLVM structure, but from what I've looked at I'm guessing these come from AArch64FastISel::materializeFP() when it materializes a floating point constant from the constant pool. If that's correct, then the problem is the placement of the constant pool, which seems to end up a little too far away, or a little too far away and in the wrong direction, while the codegen looks to assume that the constant pool is close by.

The ADRP/LDR sequence should be able to reach +/-4Gb from the fixup page. If the memory manager were allocating sections independently I would expect occasional crashes when sections connected by fixups happen to be allocated out-of-range. As a stop-gap solution to this the RuntimeDyld::MemoryManager interface provides the needsToeserveAllocationSpace and reserveAllocationSpace methods. @vgvassilev pointed me to https://github.com/root-project/root/commit/a7b0b3e647409c7510b38198b08ff94fd079f857 -- It looks like that was attempting to implement those methods to address a similar problem, but I'm not sure it went far enough:

  void reserveAllocationSpace(uintptr_t CodeSize, uint32_t CodeAlign,
                              uintptr_t RODataSize, uint32_t RODataAlign,
                              uintptr_t RWDataSize, uint32_t RWDataAlign) override {
    m_Code.allocate(getExeMM(),CodeSize, CodeAlign, true, false);
    m_ROData.allocate(getExeMM(),RODataSize, RODataAlign, false, true);
    m_RWData.allocate(getExeMM(),RWDataSize, RWDataAlign, false, false);

    m_jit.m_SectionsAllocatedSinceLastLoad.insert(m_Code.m_Start);
    m_jit.m_SectionsAllocatedSinceLastLoad.insert(m_ROData.m_Start);
    m_jit.m_SectionsAllocatedSinceLastLoad.insert(m_RWData.m_Start);
  }

There are independent calls to some 'allocate' function here: Either a slab large enough to accommodate all JIT'd memory was allocated up front (in which case implementing reserveAllocationSpace was redundant), or these are still separate allocation calls under the hood, in which case you probably still risk having them allocated out-of-range.

A canonical reserveAllocationSpace call looks more like:

  void reserveAllocationSpace(uintptr_t CodeSize, uint32_t CodeAlign,
                              uintptr_t RODataSize, uint32_t RODataAlign,
                              uintptr_t RWDataSize, uint32_t RWDataAlign) override {
    size_t TotalSize =
      computeRequiredSize(CodeSize, CodeAlign,
                          RODataSize, RODataAlign,
                          RWDataSize, RWDataAlign);
    CurrentSlab = reserve(TotalSize);
  }

Then in allocateCodeSection / allocateDataSection you would return pointers into CurrentSlab.

FWIW the memory management APIs were redesigned to address this issue in JITLInk (LLVM's new JIT linker). The JITLinkMemoryManager interface requires all sections and sizes to be passed in one allocation call, making slab allocation for each object the natural default. JITLink also range checks all allocations and issues runtime errors with clean termination: You would have seen a "relocation target out of range" error with details on the target and fixup location, even in release builds.

There is no JITLink implementation for ELF / aarch64 yet, but we're not far off having one. JITLink for ELF / x86-64 is maturing quickly and aarch64 is the next natural target. This may make life easier in the future.

vgvassilev commented 3 years ago

Thanks @lhames for the detailed explanation!

I am adding @pcanal who implemented this as part of a fix for ROOT-8523.

pcanal commented 3 years ago

@lhames description makes sense to me too and the code seems indeed an improvement. I am not sure though whether this would be enough to address the current issue (which admittedly I am not well understanding). In the issue I addressed (if I recall correctly) the issue was mostly about the contiguous-ness of the code section. ... reading further... I see " the problem is the placement of the constant pool, which seems to end up a little too far away, or a little too far away and in the wrong direction,:" ... so indeed @lhames's further improvement would solve this.

dan131riley commented 3 years ago

The routine RPCSimSetUp::setRPCSetUp does a tremendous amount of output formatting which is then never seen because the resulting string is passed to LogDebug. See

https://github.com/cms-sw/cmssw/blob/5b54e3a1fc64b1d9764a31ae73226f8f67428f52/SimMuon/RPCDigitizer/src/RPCSimSetUp.cc#L106

I made PR #33071 to #ifdef EDM_ML_DEBUG all the stringstream operations.

vgvassilev commented 3 years ago

Would that fix the original problem -- https://github.com/root-project/root/pull/7419

hahnjo commented 3 years ago

Would that fix the original problem -- root-project/root#7419

I don't see why it should: The problem started before the LLVM upgrade and the linked PR disables GlobalISel which became enabled as a by-product of the upgrade. I think the initial issue described here is real and happens due to "circumstances" that make a code section more than 4Gb of virtual address space away from the data. Which is not something we can really avoid unless we statically pre-allocate memory for all sections that are ever going to be emitted by JIT. The nicer solution would be if AArch64 supports a relocation that can reference memory across the entire address space...

dan131riley commented 3 years ago

I agree that it's unlikely to help. What needs to be done is implement the suggestion by @lhames to allocate the data and code in one allocation so they are guaranteed to be close by. I don't know if this would solve every use case, but I'm confident it would fix all the CMS ones that I'm aware of. Is this on anyone's todo list?

vgvassilev commented 3 years ago

Thanks for the explanation @hahnjo!

@dan131riley, I am not aware of being on anybody’s workplan.

lhames commented 3 years ago

Which is not something we can really avoid unless we statically pre-allocate memory for all sections that are ever going to be emitted by JIT.

In the newer (ORCv2) JIT design there are a couple of ways to approach this problem (and I'm posting here even though you're not on ORCv2 yet, since it touches on relevant topics):

  1. If you can guarantee that there are no JIT'd external references to variables with hidden visibility:

In this case you can safely allocate on a per-object basis even with the small code model. References to variables outside the current object will go via a GOT entry (automatically optimized to direct reference if the external variable ends up being in-range of the JIT'd code), and calls to externals will go via a jump stub (automatically bypassed if the call target ends up being in-range).

  1. If you can't guarantee that there are no JIT'd external references to variables with hidden visibility (the general case):

In this case we're allowed to assume that the variable will be in the same JITDylib as the reference, which has different implications for address ranges depending on the code model:

Small code model: The code generator is allowed to assume that all references between code and data within the JITDylib can be expressed with direct PC-relative addressing. To satisfy this assumption the client must reserve sufficient address space (and you only need to reserve the address space, you can attach actual memory to it later as needed) for all JIT'd code and data on a per-JITDylib basis up-front. On a 64-bit system this is probably practical. On a 32-bit one it's harder and may become a serious constraint, especially if more than one JITDylib is required.

Large code model: The code generator cannot assume anything about the address-range of references between code and data within a JITDylib. All loads go via a GOT (or by splatting an immediate into register and loading from that), and all calls are typically indirect via a register. This saves the client from reserving address space up-front, at the cost of some runtime performance (due to the indirection and it's potential impacts on prediction and cache performance), and some link-time performance (more relocations may be required).

Large code model is a requirement for MCJIT / ORCv1, since RuntimeDyld never fully dealt with this problem, or implemented the full set of relocations required to support the small code model on key platform (e.g. arm64, x86-64).

The nicer solution would be if AArch64 supports a relocation that can reference memory across the entire address space...

In ORCv2 this is satisfied by the large code model above. Unfortunately MCJIT / ORCv1 adds an extra twist: Even in the large code model you can usually assume that local code and data within an object file are within range of one another, but the separation of allocateCodeSection / allocateDataSection in RuntimeDyld's memory manager make it possible (if you don't pre-reserve space for the whole object) to allocate code and data for a single object out-of-range of one another. Using the reserveAllocationSpace trick fixes this for MCJIT / ORCv1.

hahnjo commented 3 years ago

Unfortunately MCJIT / ORCv1 adds an extra twist: Even in the large code model you can usually assume that local code and data within an object file are within range of one another, but the separation of allocateCodeSection / allocateDataSection in RuntimeDyld's memory manager make it possible (if you don't pre-reserve space for the whole object) to allocate code and data for a single object out-of-range of one another. Using the reserveAllocationSpace trick fixes this for MCJIT / ORCv1.

But it does not for incremental JITting, right? I'm thinking about declaring a large array in the first module that is referenced by code in the second object. Then we need to support arbitrary relocations into the entire address space (unless I'm missing something here).

Axel-Naumann commented 3 years ago

Thanks a lot @lhames for this explanation!

FYI @dan131riley: @hahnjo will be looking into the object pre-reservation described by @lhames - hoping we can come up with a way to make it work (see his comment above). And IIUC @vgvassilev will tackle the upgrade to ORCv2 this year.

lhames commented 3 years ago

But it does not for incremental JITting, right? I'm thinking about declaring a large array in the first module that is referenced by code in the second object. Then we need to support arbitrary relocations into the entire address space (unless I'm missing something here).

You do need arbitrary relocations into the entire address space to solve this, but it turns out that we already generate them for data references (even in small code model) because regular dynamic linking introduces the same class of problem that you're describing: When you see a declaration like extern int X; in C code there's nothing to tell the compiler/codegen whether X will eventually be part of the same library, or will come from some other dynamic library / shared object. For that reason, even in the small code model, codegen will generate a sequence like this:

movq X@GOT(%rip), %rax    ; Materialize address of X into %rax by loading from a GOT entry
movl (%rax), %eax         ; Indirectly load actual value of X (from address in %rax)

At link time if X turns out to be defined in your library then the static linker can rewrite this sequence to:

leaq X(%rip), %rax        ; PC relative address calculation for X (fast)
movl (%rax), %eax         ; Indirectly load actual value of X (from address in %rax)

On the other hand if X turns out not to be defined in your library then the linker synthesizes a GOT (Global Offset Table) entry pointing to X (and a dynamic fixup to patch that entry up at load time), and then you just load the address from the table entry.

The new JIT linker knows all these tricks (both GOT synthesis and how to optimize for in-range targets). The caveat that I drew attention to above is hidden externs. For a hidden extern under the small code model codegen is allowed to generate:

movl X(%rax), %eax        ; Directly load X

There's simply no way to rewrite that to make it safe if X is out-of-range of a PC-relative reference from the movl instruction. That's why extern hiddens require you to preallocate address space slabs for whole JITDylibs at a time.

In any case the advice in ORCv2 is to pre-reserve address ranges if possible: It makes hidden externals work, but also guarantees that the range-based optimizations will always fire. If pre-reserving ranges is not possible that's ok too, but in that case you can't use hidden externals (If you do and they're allocated out of range you'll at least get a clean "out-of-range" error from the JIT linker, but the danger is that you'll get lucky a lot of the time and things will silently work right up until your luck runs out, probably in the middle of some critical job).

hahnjo commented 3 years ago

Okay, before continuing the technical discussion, I'd like to take a moment to make sure that everybody is talking about the same problem (because it looks like at least I didn't). I was able to produce a crash on AArch64 Linux with the following two lines in the interactive root interpreter:

root [0] void *ptr = malloc(4L << 30);
root [1] ROOT::RDataFrame(1).Define("x0", "42").Define("x1", "42").Count().GetValue()
Backtrace of the crash with debug information ``` #0 0x0000ffffbdd05238 in raise () from /lib64/libc.so.6 #1 0x0000ffffbdd068b0 in abort () from /lib64/libc.so.6 #2 0x0000ffffbdcfe72c in __assert_fail_base () from /lib64/libc.so.6 #3 0x0000ffffbdcfe7e4 in __assert_fail () from /lib64/libc.so.6 #4 0x0000ffffb6fd33a0 in llvm::RuntimeDyldELF::resolveAArch64Relocation (this=0x1fb7270, Section=..., Offset=7120, Value=281473584855448, Type=261, Addend=0) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:400 #5 0x0000ffffb6fd51dc in llvm::RuntimeDyldELF::resolveRelocation (this=0x1fb7270, Section=..., Offset=7120, Value=281473584855448, Type=261, Addend=0, SymOffset=0, SectionID=41) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:937 #6 0x0000ffffb6fd5118 in llvm::RuntimeDyldELF::resolveRelocation (this=0x1fb7270, RE=..., Value=281473584855448) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:920 #7 0x0000ffffb6fb9ea8 in llvm::RuntimeDyldImpl::resolveRelocationList (this=0x1fb7270, Relocs=..., Value=281473584855448) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1052 #8 0x0000ffffb6fb60ac in llvm::RuntimeDyldImpl::resolveLocalRelocations (this=0x1fb7270) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:152 #9 0x0000ffffb6fb5ee0 in llvm::RuntimeDyldImpl::resolveRelocations (this=0x1fb7270) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:135 #10 0x0000ffffb6fbb370 in llvm::RuntimeDyld::resolveRelocations (this=0x10aa540) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1346 #11 0x0000ffffb6fbb450 in llvm::RuntimeDyld::finalizeWithMemoryManagerLocking (this=0x10aa540) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1364 #12 0x0000ffffb5746160 in llvm::orc::LegacyRTDyldObjectLinkingLayer::ConcreteLinkedObject >::finalize (this=0x1c9b8f0) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:255 #13 0x0000ffffb574635c in llvm::orc::LegacyRTDyldObjectLinkingLayer::ConcreteLinkedObject >::getSymbolMaterializer(std::string)::{lambda()#1}::operator()() const (__closure=0x1d9d100) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:276 #14 0x0000ffffb57475a4 in std::_Function_handler (), llvm::orc::LegacyRTDyldObjectLinkingLayer::ConcreteLinkedObject >::getSymbolMaterializer(std::string)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/4.8.2/functional:2057 #15 0x0000ffffb572348c in std::function ()>::operator()() const (this=0xffffffffc228) at /usr/include/c++/4.8.2/functional:2471 #16 0x0000ffffb57225ac in llvm::JITSymbol::getAddress (this=0xffffffffc228) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:297 #17 0x0000ffffb5739d90 in llvm::orc::LazyEmittingLayer >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::LegacyIRCompileLayer&)::{lambda()#1}::operator()() const (__closure=0x11ae0d0) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/LazyEmittingLayer.h:68 #18 0x0000ffffb5741098 in std::_Function_handler (), llvm::orc::LazyEmittingLayer >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::LegacyIRCompileLayer&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/4.8.2/functional:2057 #19 0x0000ffffb572348c in std::function ()>::operator()() const (this=0xffffffffc378) at /usr/include/c++/4.8.2/functional:2471 #20 0x0000ffffb57225ac in llvm::JITSymbol::getAddress (this=0xffffffffc378) at /home/sftnight/build/JONAS/root.src/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:297 #21 0x0000ffffb5722c9c in cling::IncrementalJIT::getSymbolAddress (this=0x4cd930, Name="_ZN11__cling_N5115__cling_Un1Qu31EPv", AlsoInProcess=false) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/IncrementalJIT.h:191 #22 0x0000ffffb572522c in cling::IncrementalExecutor::jitInitOrWrapper (this=0x4cf3d0, funcname=..., fun=@0xffffffffc488: 0x4b66a0) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/IncrementalExecutor.h:275 #23 0x0000ffffb5721ab8 in cling::IncrementalExecutor::executeWrapper (this=0x4cf3d0, function=..., returnValue=0xffffffffc860) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:372 #24 0x0000ffffb560a6d0 in cling::Interpreter::RunFunction (this=0x4b65e0, FD=0xce3f78, res=0xffffffffc860) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/Interpreter.cpp:1139 #25 0x0000ffffb560b074 in cling::Interpreter::EvaluateInternal (this=0x4b65e0, input="#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", CO=..., V=0xffffffffc860, wrapPoint=44) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/Interpreter.cpp:1389 #26 0x0000ffffb56095fc in cling::Interpreter::process (this=0x4b65e0, input="#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", V=0xffffffffc860, T=0x0, disableValuePrinting=false) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/Interpreter.cpp:817 #27 0x0000ffffb581c440 in cling::MetaProcessor::process (this=0x992650, input_line=..., compRes=@0xffffffffcb4c: cling::Interpreter::kSuccess, result=0xffffffffc860, disableValuePrinting=false) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/MetaProcessor/MetaProcessor.cpp:342 #28 0x0000ffffb5405024 in HandleInterpreterException (metaProcessor=0x992650, input_line=0xd01b20 "#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", compRes=@0xffffffffcb4c: cling::Interpreter::kSuccess, result=0xffffffffc860) at /home/sftnight/build/JONAS/root.src/core/metacling/src/TCling.cxx:2427 #29 0x0000ffffb5405a38 in TCling::ProcessLine (this=0x4b57a0, line=0xd01aa0 "#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", error=0xffffffffce3c) at /home/sftnight/build/JONAS/root.src/core/metacling/src/TCling.cxx:2587 #30 0x0000ffffbe39cce4 in TApplication::ProcessLine (this=0x49d020, line=0xd01aa0 "#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", sync=false, err=0xffffffffce3c) at /home/sftnight/build/JONAS/root.src/core/base/src/TApplication.cxx:1472 #31 0x0000ffffbe76c178 in TRint::ProcessLineNr (this=0x49d020, filestem=0xffffbe77aa38 "ROOT_prompt_", line=0xcde530 "ROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", error=0xffffffffce3c) at /home/sftnight/build/JONAS/root.src/core/rint/src/TRint.cxx:751 #32 0x0000ffffbe76ba64 in TRint::HandleTermInput (this=0x49d020) at /home/sftnight/build/JONAS/root.src/core/rint/src/TRint.cxx:612 #33 0x0000ffffbe7694e4 in TTermInputHandler::Notify (this=0x9922f0) at /home/sftnight/build/JONAS/root.src/core/rint/src/TRint.cxx:132 #34 0x0000ffffbe76db34 in TTermInputHandler::ReadNotify (this=0x9922f0) at /home/sftnight/build/JONAS/root.src/core/rint/src/TRint.cxx:124 #35 0x0000ffffbe51c94c in TUnixSystem::CheckDescriptors (this=0x443960) at /home/sftnight/build/JONAS/root.src/core/unix/src/TUnixSystem.cxx:1322 #36 0x0000ffffbe51bdd4 in TUnixSystem::DispatchOneEvent (this=0x443960, pendingOnly=false) at /home/sftnight/build/JONAS/root.src/core/unix/src/TUnixSystem.cxx:1077 #37 0x0000ffffbe4105ac in TSystem::InnerLoop (this=0x443960) at /home/sftnight/build/JONAS/root.src/core/base/src/TSystem.cxx:404 #38 0x0000ffffbe410360 in TSystem::Run (this=0x443960) at /home/sftnight/build/JONAS/root.src/core/base/src/TSystem.cxx:354 #39 0x0000ffffbe39d6ac in TApplication::Run (this=0x49d020, retrn=false) at /home/sftnight/build/JONAS/root.src/core/base/src/TApplication.cxx:1624 #40 0x0000ffffbe76aeec in TRint::Run (this=0x49d020, retrn=false) at /home/sftnight/build/JONAS/root.src/core/rint/src/TRint.cxx:463 #41 0x0000000000400c24 in main (argc=1, argv=0xfffffffff378) at /home/sftnight/build/JONAS/root.src/main/src/rmain.cxx:30 ```

I think this backtrace matches https://github.com/cms-sw/cmssw/issues/31123#issuecomment-778318184 and my understanding is that it happens when we're loading libROOTDataFrame.so and its dependencies. Maybe @Axel-Naumann can confirm? (or knowns how to) @dan131riley do you think this matches what is happening in CMSSW? (loading your own libraries, of course)

dan131riley commented 3 years ago

I think this backtrace matches #31123 (comment) and my understanding is that it happens when we're loading libROOTDataFrame.so and its dependencies.

That looks like the same issue, but it isn't during library loading, it's in the process of JITting

ROOT::RDataFrame(1).Define("x0", "42").Define("x1", "42").Count().GetValue()

which is all under

#26 0x0000ffffb56095fc in cling::Interpreter::process (this=0x4b65e0, input="#line 1 \"ROOT_prompt_1\"\nROOT::RDataFrame(1).Define(\"x0\", \"42\").Define(\"x1\", \"42\").Count().GetValue()", V=0xffffffffc860, T=0x0,                     
    disableValuePrinting=false) at /home/sftnight/build/JONAS/root.src/interpreter/cling/lib/Interpreter/Interpreter.cpp:817                                                                                                                

My understanding is that, when that line is JITted, the memory for the constants and the code are allocated separately, while the compiler and runtime loader are assuming that the constants will be nearby the code--often I believe the constants will be allocated just before the code that references it. Allocating the code and constants independently can violate the assumption of addressability.

lhames commented 3 years ago

My understanding is that, when that line is JITted, the memory for the constants and the code are allocated separately, while the compiler and runtime loader are assuming that the constants will be nearby the code--often I believe the constants will be allocated just before the code that references it. Allocating the code and constants independently can violate the assumption of addressability.

That sounds right to me, and fits neatly with the backtrace @hahnjo linked. The discussion above is relevant -- the code examples were x86-64, but there are equivalent sequences and constraints exist for aarch64.

hahnjo commented 3 years ago

Right, I got confused by the auto-loading / -parsing frames in the original backtrace. However, I'm still not sure that the issue is really about constants and global data accesses - all these seem to respect the large code model that makes no assumptions about addressibility.

I took a closer look at the crash from my previous comment, and I'm 99% sure that this is coming from a relocation in .eh_frame (allocated as data section) to a code section that is more than 4Gb away. I think there are two roads here:

  1. Find out whether .eh_frame can cope with relocations across the entire address space and if so, why LLVM doesn't use that in the large code model.
  2. Implement Lang's proposal from https://github.com/cms-sw/cmssw/issues/31123#issuecomment-780200564 - however do I understand correctly that this requires Cling to do the dance of setting the page permission bits correctly? (currently handled by SectionMemoryManager)
hahnjo commented 3 years ago
  1. Find out whether .eh_frame can cope with relocations across the entire address space and if so, why LLVM doesn't use that in the large code model.

Well, LLVM does for ppc64 and x86_64 but the cases for aarch64 were missing. This was fixed by https://github.com/llvm/llvm-project/commit/18805ea951be02fcab6e7b11c3c7d929bcf1441a upstream and I've prepared a backport in https://github.com/root-project/root/pull/7563. This at least fixes the case I posted in https://github.com/cms-sw/cmssw/issues/31123#issuecomment-800450831. @dan131riley I would be super grateful if you could apply these two lines and test on your side (edit: or generally after the upgrade to LLVM 9, just to make sure it's not already fixed deeper down in the stack).

mrodozov commented 3 years ago

@hahnjo I can't reproduce your example https://github.com/cms-sw/cmssw/issues/31123#issuecomment-800450831 can you describe a bit further what machine have you used and which ROOT version ? I tried 6.22 and master on Arm

hahnjo commented 3 years ago

@hahnjo I can't reproduce your example #31123 (comment) can you describe a bit further what machine have you used and which ROOT version ? I tried 6.22 and master on Arm

Sure: I was building ROOT master on techlab-arm64-thunderx2-01 in full Debug mode (-DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug) in order to get all asserts. It doesn't really matter if you can reproduce my example (I mean, I made sure that this one is fixed), but the important data point would be if that also fixes the crash in CMSSW that @dan131riley was able to produce. If the crash is still there, it must be a different problem and I have to dig deeper (with an improved example for testing).

mrodozov commented 3 years ago

Right, thank you for the clarification. I'm using similar machine (prob almost the same) so I'm going to change the flags and retry, however ...

In CMSSW we are using 6.22 ROOT, not master, currently this one to be specific and I tried to reproduced your PR from master to 6.22 in here: https://github.com/cms-sw/root/commit/fae0f05c92383de4b8d98856444436d1234a8b78 If you confirm this is how the backport should look like I'll merge the change and build an integration build release (IB) so this change can be available in a release for convenience usage

hahnjo commented 3 years ago

@mrodozov the backport seems fine, but do you need a full release to test if the crash is gone? I had hoped that you have local development builds of CMS-SW, plus you really need a Debug build in order to see the assert. (I plan to backport this to v6-22-patches once it lands in ROOT master, but I wanted confirmation that it really fixes the issue)

dan131riley commented 3 years ago

@hahnjo I do have a local dev area with a Debug ROOT that I can hook into a recent CMSSW IB, hopefully will get to trying your patch sometime today.

vgvassilev commented 3 years ago

@hahnjo, the backport you suggest is in any case good to have. Why don’t we just go ahead and merge it and then cmssw can pick up the new master and will get an answer probably by tomorrow?

mrodozov commented 3 years ago

we are building this release anyway, it's just a minor edit for the Arm build and a few hours earlier. it's not only to check if the example crash is gone, I want to see if it fixes pieces in cmssw. and a release with debug flags for ROOT will also be helpful

dan131riley commented 3 years ago

It looks like the patch by @hahnjo does fix the problem. I ran around a hundred test jobs using the backport by @mrodozov in a Debug build, with no assertion failures or crashes observed.

hahnjo commented 3 years ago

It looks like the patch by @hahnjo does fix the problem. I ran around a hundred test jobs using the backport by @mrodozov in a Debug build, with no assertion failures or crashes observed.

Thanks for testing! I backported the fix for AArch64 and a similar commit for PowerPC to 6.22 in https://github.com/root-project/root/pull/7607 and the fix will also be included in 6.24. In case you see the issue come back in later testing, please ping me :smiley:

mrodozov commented 3 years ago

@dan131riley would you please share with us which workflows did you ran as examples that didn't crash after the last ROOT change, or maybe a workflow that you see in the IBs that doesn't fail anymore ?