cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

Probably thread related crashes in aarch64 IBs #31123

Closed Dr15Jones closed 3 years ago

Dr15Jones commented 4 years ago

After switching to run the IB RelVals using multiple threads, we are seeing 'random' crashes in the aarch64 builds.

dan131riley commented 3 years ago

@dan131riley would you please share with us which workflows did you ran as examples that didn't crash after the last ROOT change, or maybe a workflow that you see in the IBs that doesn't fail anymore ?

I tested with 4.62 and 136.776. I don't see any TFormula related crashes in those specific workflows since the update, but we are still seeing lots of crashes in closely adjacent workflows. It isn't immediately obvious to me whether my tests were flawed or those specific workflows were somehow fixed, but it's apparent the overall problem is not fixed. It will likely be a few days before I can take a closer look.

hahnjo commented 3 years ago

@mrodozov @dan131riley if you have another failing test, please ping me on the stack trace and I can take a look.

dan131riley commented 3 years ago

@mrodozov 136.776 hasn't crashed for me at all, and 10809.0 crashes much less frequently, so I believe a bug was fixed, but there still seems to be a problem. I did get 4.62 to crash with a debug build.

@hahnjo with a debug build, we're now getting an assertion failure for a different reloc type in the same routine:

(gdb) up 9
#9  0x0000ffffa4ceca80 in llvm::RuntimeDyldELF::resolveAArch64Relocation (this=0xffff7b056800, Section=..., Offset=56, Value=281471755485400, Type=275, Addend=0) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:400
400     assert(isInt<33>(Result) && "overflow check failed for relocation");
(gdb) p/x Value
$1 = 0xffff400000d8
(gdb) p/x Addend
$2 = 0x0
(gdb) p/x FinalAddress
$3 = 0xfffe389c2980
(gdb) p/x Result
$4 = 0x10763e000
(gdb) up 20
#29 0x0000ffffa3620d98 in TClingCallFunc::compile_wrapper (this=0xfffe17284800, wrapper_name="__cf_102", wrapper="#pragma clang diagnostic push\n#pragma clang diagnostic ignored \"-Wformat-security\"\n__attribute__((used)) extern \"C\" void __cf_102(void* obj, int nargs, void** args, void* ret)\n{\n   if (ret) {\n      ne"..., withAccessControl=true) at /home/dsr/root/core/metacling/src/TClingCallFunc.cxx:267
267    return fInterp->compileFunction(wrapper_name, wrapper, false /*ifUnique*/,
(gdb) print -elements unlimited -- wrapper
$6 = "#pragma clang diagnostic push\n#pragma clang diagnostic ignored \"-Wformat-security\"\n__attribute__((used)) extern \"C\" void __cf_102(void* obj, int nargs, void** args, void* ret)\n{\n   if (ret) {\n      new (ret) (float) (((const reco::GsfElectron*)obj)->closestCtfTrackNormChi2());\n      return;\n   }\n   else {\n      ((const reco::GsfElectron*)obj)->closestCtfTrackNormChi2();\n      return;\n   }\n}\n#pragma clang diagnostic pop"

stack trace:

cmsRun: /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:400: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const llvm::SectionEntry&, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.

#5  0x0000ffffb395bc1c in raise () from /lib64/libc.so.6
#6  0x0000ffffb39497a8 in abort () from /lib64/libc.so.6
#7  0x0000ffffb39552e8 in __assert_fail_base () from /lib64/libc.so.6
#8  0x0000ffffb3955350 in __assert_fail () from /lib64/libc.so.6
#9  0x0000ffffa4ceca80 in llvm::RuntimeDyldELF::resolveAArch64Relocation (this=0xffff7b056800, Section=..., Offset=56, Value=281471755485400, Type=275, Addend=0) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:400
#10 0x0000ffffa4ceee90 in llvm::RuntimeDyldELF::resolveRelocation (this=0xffff7b056800, Section=..., Offset=56, Value=281471755485400, Type=275, Addend=0, SymOffset=0, SectionID=62) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:895
#11 0x0000ffffa4ceed54 in llvm::RuntimeDyldELF::resolveRelocation (this=0xffff7b056800, RE=..., Value=281471755485400) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:877
#12 0x0000ffffa4ccc5c0 in llvm::RuntimeDyldImpl::resolveRelocationList (this=0xffff7b056800, Relocs=..., Value=281471755485400) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:957
#13 0x0000ffffa4cc86d0 in llvm::RuntimeDyldImpl::resolveRelocations (this=0xffff7b056800) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:145
#14 0x0000ffffa4ccd1f8 in llvm::RuntimeDyld::resolveRelocations (this=0xffffca10ebd8) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1140
#15 0x0000ffffa4ccd2e4 in llvm::RuntimeDyld::finalizeWithMemoryManagerLocking (this=0xffffca10ebd8) at /home/dsr/root/interpreter/llvm/src/lib/ExecutionEngine/RuntimeDyld/RuntimeDyld.cpp:1158
#16 0x0000ffffa37f3fc0 in llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}::operator()(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>) const (__closure=0xffff79f43720, H=Python Exception <type 'exceptions.ValueError'> Cannot find type llvm::orc::RTDyldObjectLinkingLayerBase::ObjHandleT::_Node: 
, RTDyld=..., ObjToLoad=std::shared_ptr<class llvm::object::OwningBinary<llvm::object::ObjectFile>> (use count 1, weak count 0) = {...}, LOSHandleLoad=...) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:274
#17 0x0000ffffa38027e8 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::finalize() (this=0xfffe1896aae0) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:143
#18 0x0000ffffa3802870 in llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::getSymbolMaterializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda()#1}::operator()() const (this=0xfffe1896aae0) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h:158
#19 0x0000ffffa38030cc in std::_Function_handler<llvm::Expected<unsigned long> (), llvm::orc::RTDyldObjectLinkingLayer::ConcreteLinkedObject<std::shared_ptr<llvm::RuntimeDyld::MemoryManager>, std::shared_ptr<llvm::JITSymbolResolver>, llvm::orc::RTDyldObjectLinkingLayer::addObject(std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> >, std::shared_ptr<llvm::JITSymbolResolver>)::{lambda(std::_List_iterator<std::unique_ptr<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject, std::default_delete<llvm::orc::RTDyldObjectLinkingLayerBase::LinkedObject> > >, llvm::RuntimeDyld&, std::shared_ptr<llvm::object::OwningBinary<llvm::object::ObjectFile> > const&, std::function<void ()>)#1}>::getSymbolMaterializer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /cvmfs/cms-ib.cern.ch/nweek-02672/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:286
#20 0x0000ffffa37e71a8 in std::function<llvm::Expected<unsigned long> ()>::operator()() const (this=0xffffca10ed80) at /cvmfs/cms-ib.cern.ch/nweek-02672/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:688
#21 0x0000ffffa37e63d4 in llvm::JITSymbol::getAddress (this=0xffffca10ed80) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:201
#22 0x0000ffffa37f9200 in llvm::orc::LazyEmittingLayer<llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler> >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler>&)::{lambda()#1}::operator()() const (this=0xffff79f4a080) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/Orc/LazyEmittingLayer.h:75
#23 0x0000ffffa37fd850 in std::_Function_handler<llvm::Expected<unsigned long> (), llvm::orc::LazyEmittingLayer<llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler> >::EmissionDeferredModule::find(llvm::StringRef, bool, llvm::orc::IRCompileLayer<cling::IncrementalJIT::RemovableObjectLinkingLayer, llvm::orc::SimpleCompiler>&)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /cvmfs/cms-ib.cern.ch/nweek-02672/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:286
#24 0x0000ffffa37e71a8 in std::function<llvm::Expected<unsigned long> ()>::operator()() const (this=0xffffca10eed8) at /cvmfs/cms-ib.cern.ch/nweek-02672/cc8_aarch64_gcc9/external/gcc/9.3.0/include/c++/9.3.0/bits/std_function.h:688
#25 0x0000ffffa37e63d4 in llvm::JITSymbol::getAddress (this=0xffffca10eed8) at /home/dsr/root/interpreter/llvm/src/include/llvm/ExecutionEngine/JITSymbol.h:201
#26 0x0000ffffa37e6780 in cling::IncrementalJIT::getSymbolAddress (this=0xffff60416c00, Name="__cf_102", AlsoInProcess=false) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalJIT.h:194
#27 0x0000ffffa37e5de8 in cling::IncrementalExecutor::getPointerToGlobalFromJIT (this=0xffff605b1420, GV=...) at /home/dsr/root/interpreter/cling/lib/Interpreter/IncrementalExecutor.cpp:379
#28 0x0000ffffa36d8460 in cling::Interpreter::compileFunction (this=0xffff62523600, name=..., code=..., ifUnique=false, withAccessControl=true) at /home/dsr/root/interpreter/cling/lib/Interpreter/Interpreter.cpp:1292
#29 0x0000ffffa3620d98 in TClingCallFunc::compile_wrapper (this=0xfffe17284800, wrapper_name="__cf_102", wrapper="#pragma clang diagnostic push\n#pragma clang diagnostic ignored \"-Wformat-security\"\n__attribute__((used)) extern \"C\" void __cf_102(void* obj, int nargs, void** args, void* ret)\n{\n   if (ret) {\n      ne"..., withAccessControl=true) at /home/dsr/root/core/metacling/src/TClingCallFunc.cxx:267
#30 0x0000ffffa3623728 in TClingCallFunc::make_wrapper (this=0xfffe17284800) at /home/dsr/root/core/metacling/src/TClingCallFunc.cxx:1117
#31 0x0000ffffa36279c4 in TClingCallFunc::IFacePtr (this=0xfffe17284800) at /home/dsr/root/core/metacling/src/TClingCallFunc.cxx:2301
#32 0x0000ffffa34fb568 in TCling::CallFunc_IFacePtr (this=0xffff60416880, func=0xfffe17284800) at /home/dsr/root/core/metacling/src/TCling.cxx:7882
#33 0x0000ffffb587de10 in edm::FunctionWithDict::FunctionWithDict(TMethod*) () from /home/dsr/CMSSW_11_2_1/lib/cc8_aarch64_gcc9/libFWCoreReflection.so
#34 0x0000ffffb5887970 in edm::TypeWithDict::functionMemberByName(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) const () from /home/dsr/CMSSW_11_2_1/lib/cc8_aarch64_gcc9/libFWCoreReflection.so
#35 0x0000ffff5283bc9c in reco::findMethod(edm::TypeWithDict const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, std::vector<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, char const*, int&) () from /home/dsr/CMSSW_11_2_1/lib/cc8_aarch64_gcc9/libCommonToolsUtils.so
#36 0x0000ffff5281ddd4 in reco::parser::MethodSetter::push(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::variant<signed char, unsigned char, short, unsigned short, int, unsigned int, long, unsigned long, double, float, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, char const*, bool) const () from /home/dsr/CMSSW_11_2_1/lib/cc8_aarch64_gcc9/libCommonToolsUtils.so
#37 0x0000ffff5281f038 in reco::parser::MethodSetter::operator()(char const*, char const*) const () from /home/dsr/CMSSW_11_2_1/lib/cc8_aarch64_gcc9/libCommonToolsUtils.so
hahnjo commented 3 years ago

@dan131riley thanks for the detailed stack trace. I've started to take a look and will report back once I have some better understanding what's going wrong.

hahnjo commented 3 years ago

Aha, so that's now the floating point constant pool not honoring the large code model. The reason I didn't see this before is that I've been testing with master and 6.24 with LLVM 9, which correctly disables the pool in the large model. Let me dig out the fix that went in between LLVM 5 (ROOT 6.22) and LLVM 9 (ROOT 6.24) and backport that to older versions of ROOT.

hahnjo commented 3 years ago

The last statement wasn't fully accurate: LLVM 9 got better at avoiding this situation by materializing FP constants in code more often, but the fallback path would still emit adrp + ldr that can only address pools +/- 4 Gb. Furthermore ROOT v6.24 seems to allocate JITted sections more closely in memory, so it is harder to violate that assumption (not sure if that is due to the upgrade of LLVM 9 or some other change). That said, the following reliably triggers a crash when executed interactively on a full Debug build:

root [0] malloc(4294967296L);
root [1] double f() { ROOT::RDataFrame(1).Define("x0", "42").Define("x1", "42").Count().GetValue(); return 200000.0; }
root [2] f()

For comparison, v6.22 (with LLVM 5) only takes the following to crash:

root [0] void *ptr = malloc(4294967296L)
(void *) 0xfffe90940010
root [1] double f() { return 200000.0; }
root [2] f()

I'm attempting to fix this issue upstream in LLVM, for ROOT master and I've prepared an early backport for ROOT v6.22. If you still have cycles available, I would appreciate a test from your side in CMSSW (with the continued disclaimer that there might be more issues lurking around...)

dan131riley commented 3 years ago

@hahnjo So far I haven't seen any failures with that patch on the previous crashing workflow. @mrodozov can we get this into the cmssw root?

I believe one reason we see this crash so often is because we use jemalloc, which can be very aggressive about using address space. Crashes are much less frequent with the glibc malloc.

dan131riley commented 3 years ago

@mrodozov @smuzaffar Can we get https://github.com/root-project/root/pull/7758 merged into cms/v6-22-00-patches?

smuzaffar commented 3 years ago

@dan131riley , I am testing it here https://github.com/cms-sw/root/pull/156 . If no issues found during tests then I will include it for next IB

smuzaffar commented 3 years ago

@dan131riley , root-project/root#7758 is now integrated. It should be available in tonight's 23h00 IB

dan131riley commented 3 years ago

The 23h00 IB seems to be taking a while for aarch64, but so far there are no TFormula crashes, all the crashes are in onnxruntime.

smuzaffar commented 3 years ago

yes we have issues with one of arm nodes (disk full) that is why relval jobs were crashed. We have restarted the jobs but as we only have arm node now so it will take some time

hahnjo commented 3 years ago

The fix is now merged upstream in LLVM and in ROOT master as well as in the branches for 6.24, 6.22, and 6.20.

all the crashes are in onnxruntime.

@dan131riley does this also involve Cling or is this a separate issue?

dan131riley commented 3 years ago

@hahnjo The aarch64 IBs are still running slow, but it looks like the CMSSW_11_3 2021-04-07-2300 slc7_aarch64_gcc9 IB has finished, and I don't see any Cling-related crashes. There are lots of onnxruntime crashes, those are unrelated to Cling and ROOT, and there's a separate issue for that at #32899.

The Cling crashes were common enough that one IB is enough to convince that the problems have all been resolved and we can close this much-too-long ticket. Thanks!

makortel commented 3 years ago

+1

The TCling issue seems to be resolved with the last fix, so let's close this issue (and open new ones for possible other crashes).

slava77 commented 3 years ago

+reconstruction

based on https://github.com/cms-sw/cmssw/issues/31123#issuecomment-822669380

let's close this issue

civanch commented 3 years ago

+1

cmsbuild commented 3 years ago

This issue is fully signed and ready to be closed.