cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

Probably thread related crashes in aarch64 IBs #31123

Closed Dr15Jones closed 3 years ago

Dr15Jones commented 4 years ago

After switching to run the IB RelVals using multiple threads, we are seeing 'random' crashes in the aarch64 builds.

cmsbuild commented 4 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 4 years ago

One such crash is https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/250207.18_NuGun_UP18+NuGun_UP18INPUT+DIGIPRMXUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25/step2_NuGun_UP18+NuGun_UP18INPUT+DIGIPRMXUP18_PU25+RECOPRMXUP18_PU25+HARVESTUP18_PU25.log#/

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 16:23:54 CEST 2020
Thread 12 (Thread 0xffff344f8460 (LWP 119100)):
#2  0x0000ffff978643b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff99ccedd0 in free@plt () from /lib64/libc.so.6
#5  0x0000ffff99cf7a60 in vfprintf () from /lib64/libc.so.6
#6  0x0000ffff99d2431c in vsnprintf () from /lib64/libc.so.6
#7  0x0000ffff9a048014 in std::__convert_from_v (__cloc=@0xffff344f6e38: 0xffff99e2f7a0 <_nl_C_locobj>, __out=__out@entry=0xffff344f6d80 "0.00345574O4\377\377", __size=__size@entry=45, __fmt=__fmt@entry=0xffff344f6e40 "%.*g") at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/aarch64-unknown-linux-gnu/bits/c++locale.h:92
#8  0x0000ffff9a072498 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double> (this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>, __s=..., __io=..., __fill=32 ' ', __mod=<optimized out>, __v=0.0034557399339973927) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ios_base.h:622
#9  0x0000ffff9a07cb98 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put (__v=0.0034557399339973927, __fill=<optimized out>, __io=..., __s=..., this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:2437
#10 std::ostream::_M_insert<double> (this=0xffff344f7568, __v=0.0034557399339973927) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ostream.tcc:73
#11 0x0000ffff354848c4 in RPCSimSetUp::setRPCSetUp(std::vector<RPCStripNoises::NoiseItem, std::allocator<RPCStripNoises::NoiseItem> > const&, std::vector<RPCClusterSize::ClusterSizeItem, std::allocator<RPCClusterSize::ClusterSizeItem> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
#12 0x0000ffff35467d1c in RPCDigiProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
[cut]
Thread 11 (Thread 0xffff34f08460 (LWP 119099)):
#2  0x0000ffff978643b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff99cfd3a8 in __printf_fp_l () from /lib64/libc.so.6
#5  0x0000ffff99cfc998 in vfprintf () from /lib64/libc.so.6
#6  0x0000ffff99d2431c in vsnprintf () from /lib64/libc.so.6
#7  0x0000ffff9a048014 in std::__convert_from_v (__cloc=@0xffff34f06e38: 0xffff99e2f7a0 <_nl_C_locobj>, __out=__out@entry=0xffff34f06d80 "", __size=__size@entry=45, __fmt=__fmt@entry=0xffff34f06e40 "%.*g") at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/aarch64-unknown-linux-gnu/bits/c++locale.h:92
#8  0x0000ffff9a072498 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double> (this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>, __s=..., __io=..., __fill=32 ' ', __mod=<optimized out>, __v=0.15190799534320831) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ios_base.h:622
#9  0x0000ffff9a07cb98 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put (__v=0.15190799534320831, __fill=<optimized out>, __io=..., __s=..., this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:2437
#10 std::ostream::_M_insert<double> (this=0xffff34f070d0, __v=0.15190799534320831) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ostream.tcc:73
#11 0x0000ffff354846dc in RPCSimSetUp::setRPCSetUp(std::vector<RPCStripNoises::NoiseItem, std::allocator<RPCStripNoises::NoiseItem> > const&, std::vector<RPCClusterSize::ClusterSizeItem, std::allocator<RPCClusterSize::ClusterSizeItem> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
#12 0x0000ffff35467d1c in RPCDigiProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
[cut]
Thread 10 (Thread 0xffff3bc78460 (LWP 62581)):
#2  0x0000ffff978643b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff99ccedd0 in free@plt () from /lib64/libc.so.6
#5  0x0000ffff99cf7a60 in vfprintf () from /lib64/libc.so.6
#6  0x0000ffff99d2431c in vsnprintf () from /lib64/libc.so.6
#7  0x0000ffff9a048014 in std::__convert_from_v (__cloc=@0xffff3bc76e38: 0xffff99e2f7a0 <_nl_C_locobj>, __out=__out@entry=0xffff3bc76d80 "0.0249435q\307;\377\377", __size=__size@entry=45, __fmt=__fmt@entry=0xffff3bc76e40 "%.*g") at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/aarch64-unknown-linux-gnu/bits/c++locale.h:92
#8  0x0000ffff9a072498 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double> (this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>, __s=..., __io=..., __fill=32 ' ', __mod=<optimized out>, __v=0.024943500757217407) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ios_base.h:622
#9  0x0000ffff9a07cb98 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put (__v=0.024943500757217407, __fill=<optimized out>, __io=..., __s=..., this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:2437
#10 std::ostream::_M_insert<double> (this=0xffff3bc770d0, __v=0.024943500757217407) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ostream.tcc:73
#11 0x0000ffff354846dc in RPCSimSetUp::setRPCSetUp(std::vector<RPCStripNoises::NoiseItem, std::allocator<RPCStripNoises::NoiseItem> > const&, std::vector<RPCClusterSize::ClusterSizeItem, std::allocator<RPCClusterSize::ClusterSizeItem> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
#12 0x0000ffff35467d1c in RPCDigiProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
[cut]
Thread 1 (Thread 0xffff995d0000 (LWP 52173)):
#3  0x0000ffff978661cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffff99d2ac94 in _IO_str_init_static_internal () from /lib64/libc.so.6
#6  0x0000ffff99d242f8 in vsnprintf () from /lib64/libc.so.6
#7  0x0000ffff9a048014 in std::__convert_from_v (__cloc=@0xffffe9fc94b8: 0xffff99e2f7a0 <_nl_C_locobj>, __out=__out@entry=0xffffe9fc9400 "", __size=__size@entry=45, __fmt=__fmt@entry=0xffffe9fc94c0 "%.*g") at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/aarch64-unknown-linux-gnu/bits/c++locale.h:92
#8  0x0000ffff9a072498 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_float<double> (this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>, __s=..., __io=..., __fill=32 ' ', __mod=<optimized out>, __v=0.032316599041223526) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ios_base.h:622
#9  0x0000ffff9a07cb98 in std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::put (__v=0.032316599041223526, __fill=<optimized out>, __io=..., __s=..., this=0xffff9a0fa510 <(anonymous namespace)::num_put_c>) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:2437
#10 std::ostream::_M_insert<double> (this=0xffffe9fc9be8, __v=0.032316599041223526) at /home/cmsbld/jenkins_a/workspace/auto-builds/CMSSW_11_1_0_pre7-slc7_aarch64_gcc820/build/CMSSW_11_1_0_pre7-build/BUILD/slc7_aarch64_gcc820/external/gcc/8.4.0/gcc-8.4.0/obj/aarch64-unknown-linux-gnu/libstdc++-v3/include/bits/ostream.tcc:73
#11 0x0000ffff3548450c in RPCSimSetUp::setRPCSetUp(std::vector<RPCStripNoises::NoiseItem, std::allocator<RPCStripNoises::NoiseItem> > const&, std::vector<RPCClusterSize::ClusterSizeItem, std::allocator<RPCClusterSize::ClusterSizeItem> > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
#12 0x0000ffff35467d1c in RPCDigiProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonRPCDigitizer.so
[cut]
Current Modules:

Module: RPCDigiProducer:simMuonRPCDigis (crashed)
Module: RPCDigiProducer:simMuonRPCDigis
Module: none
Module: RPCDigiProducer:simMuonRPCDigis
Module: RPCDigiProducer:simMuonRPCDigis

A fatal system signal has occurred: segmentation violation

This one is failing while trying to write a numeric value to an ostream. This std implementation is calling the underlying vsnprintf which is where the failure occurs.

Dr15Jones commented 4 years ago

assign core

cmsbuild commented 4 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 4 years ago

The routine RPCSimSetUp::setRPCSetUp does a tremendous amount of output formatting which is then never seen because the resulting string is passed to LogDebug. See

https://github.com/cms-sw/cmssw/blob/5b54e3a1fc64b1d9764a31ae73226f8f67428f52/SimMuon/RPCDigitizer/src/RPCSimSetUp.cc#L106

Dr15Jones commented 4 years ago

Here is another https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/250202.3_TTbar_13+TTbar_13INPUT+PREMIXUP15_PU25+DIGIPRMXLOCALUP15APVSimu_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25/step4_TTbar_13+TTbar_13INPUT+PREMIXUP15_PU25+DIGIPRMXLOCALUP15APVSimu_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25.log

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 17:42:11 CEST 2020
Thread 5 (Thread 0xffff35228460 (LWP 227755)):
#2  0x0000ffff93df43b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff206ba478 in MuonAssociatorByHitsHelper::getShared(std::vector<std::unique_ptr<std::pair<unsigned int, std::vector<std::pair<unsigned int, EncodedEventId>, std::allocator<std::pair<unsigned int, EncodedEventId> > > >, std::default_delete<std::pair<unsigned int, std::vector<std::pair<unsigned int, EncodedEventId>, std::allocator<std::pair<unsigned int, EncodedEventId> > > > > >, std::allocator<std::unique_ptr<std::pair<unsigned int, std::vector<std::pair<unsigned int, EncodedEventId>, std::allocator<std::pair<unsigned int, EncodedEventId> > > >, std::default_delete<std::pair<unsigned int, std::vector<std::pair<unsigned int, EncodedEventId>, std::allocator<std::pair<unsigned int, EncodedEventId> > > > > > > >&, __gnu_cxx::__normal_iterator<TrackingParticle const*, std::vector<TrackingParticle, std::allocator<TrackingParticle> > >) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimMuonMCTruth.so
#5  0x0000ffff206c0e78 in MuonAssociatorByHitsHelper::associateSimToRecoIndices(std::vector<std::pair<__gnu_cxx::__normal_iterator<TrackingRecHit* const*, std::vector<TrackingRecHit*, std::allocator<TrackingRecHit*> > >, __gnu_cxx::__normal_iterator<TrackingRecHit* const*, std::vector<TrackingRecHit*, std::allocator<TrackingRecHit*> > > >, std::allocator<std::pair<__gnu_cxx::__normal_iterator<TrackingRecHit* const*, std::vector<TrackingRecHit*, std::allocator<TrackingRecHit*> > >, __gnu_cxx::__normal_iterator<TrackingRecHit* const*, std::vector<TrackingRecHit*, std::allocator<TrackingRecHit*> > > > > > const&, edm::RefVector<std::vector<TrackingParticle, std::allocator<TrackingParticle> >, TrackingParticle, edm::refhelper::FindUsingAdvance<std::vector<TrackingParticle, std::allocator<TrackingParticle> >, TrackingParticle> > const&, MuonAssociatorByHitsHelper::Resources const&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimMuonMCTruth.so
#6  0x0000ffff206af2e4 in MuonAssociatorByHits::associateSimToReco(edm::RefToBaseVector<reco::Track> const&, edm::RefVector<std::vector<TrackingParticle, std::allocator<TrackingParticle> >, TrackingParticle, edm::refhelper::FindUsingAdvance<std::vector<TrackingParticle, std::allocator<TrackingParticle> >, TrackingParticle> > const&, edm::Event const*, edm::EventSetup const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimMuonMCTruth.so
#7  0x0000ffff206b9a4c in MuonAssociatorByHits::associateSimToReco(edm::Handle<edm::View<reco::Track> >&, edm::Handle<std::vector<TrackingParticle, std::allocator<TrackingParticle> > >&, edm::Event const*, edm::EventSetup const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimMuonMCTruth.so
#8  0x0000ffff205840cc in MuonAssociatorEDProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimMuonMCTruthPlugins.so
[cut]
Thread 4 (Thread 0xffff35c38460 (LWP 227754)):
#0  0x0000ffff99264e24 in poll () from /lib64/libc.so.6
#1  0x0000ffff93df4a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffff93df51cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffff93df61cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffff200000b8 in ?? ()
#6  0x0000ffff35c37380 in ?? ()
#7  0x000c001200033160 in ?? ()
Thread 3 (Thread 0xffff36648460 (LWP 227753)):
#2  0x0000ffff93df43b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff99218e6c in _wordcopy_fwd_aligned () from /lib64/libc.so.6
#5  0x0000ffff99218d94 in memcpy () from /lib64/libc.so.6
#6  0x00000000004163e8 in void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) ()
#7  0x0000ffff9b2b6c30 in edm::Event::commit_aux(std::vector<edm::propagate_const<std::unique_ptr<edm::WrapperBase, std::default_delete<edm::WrapperBase> > >, std::allocator<edm::propagate_const<std::unique_ptr<edm::WrapperBase, std::default_delete<edm::WrapperBase> > > > >&, edm::Hash<5>*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#8  0x0000ffff9b2b6f0c in edm::Event::commit_(std::vector<unsigned int, std::allocator<unsigned int> > const&, edm::Hash<5>*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#9  0x0000ffff9b3f8004 in edm::stream::EDFilterAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
[cut]
Thread 1 (Thread 0xffff98ab0000 (LWP 226821)):
#2  0x0000ffff93df43b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff3de26ee0 in Chi2MeasurementEstimator::estimate(TrajectoryStateOnSurface const&, TrackingRecHit const&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libTrackingToolsKalmanUpdators.so
#5  0x0000ffff3df1d9d4 in TkStripMeasurementDet::simpleRecHits(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, std::vector<SiStripRecHit2D, std::allocator<SiStripRecHit2D> >&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerMeasurementDetPlugins.so
#6  0x0000ffff3df0f048 in void TkGluedMeasurementDet::doubleMatch<TkGluedMeasurementDet::HitCollectorForFastMeasurements>(TrajectoryStateOnSurface const&, MeasurementTrackerEvent const&, TkGluedMeasurementDet::HitCollectorForFastMeasurements&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x0000ffff3df0bd2c in TkGluedMeasurementDet::measurements(TrajectoryStateOnSurface const&, MeasurementEstimator const&, MeasurementTrackerEvent const&, tracking::TempMeasurements&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x0000ffff3ddd6548 in LayerMeasurements::groupedMeasurements(DetLayer const&, TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libTrackingToolsMeasurementDet.so
#9  0x0000fffdf15a1634 in TrajectorySegmentBuilder::segments(TrajectoryStateOnSurface) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerCkfPatternPlugins.so
#10 0x0000fffdf158ddd0 in GroupedCkfTrajectoryBuilder::advanceOneLayer(TrajectorySeed const&, TempTrajectory&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerCkfPatternPlugins.so
#11 0x0000fffdf158e8dc in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerCkfPatternPlugins.so
#12 0x0000fffdf158f128 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginRecoTrackerCkfPatternPlugins.so
#13 0x0000fffdf2589464 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libRecoTrackerCkfPattern.so
[cut]
Current Modules:

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)
Module: CkfTrackCandidateMaker:convTrackCandidates
Module: none
Module: none
Module: none

The crash happened on thread 4 which has a 'corrupted' stack trace. I saw other RelVals which also had crashes where the thread that crashed had a 'corrupted' stack trace.

Dr15Jones commented 4 years ago

Another RelVal with a corrupted stack is https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/250408.17_QCD_FlatPt_15_3000HS_13+FS_QCD_FlatPt_15_3000HS_13_PRMXUP17_PU50+HARVESTUP17FS+MINIAODMCUP17FS/step1_QCD_FlatPt_15_3000HS_13+FS_QCD_FlatPt_15_3000HS_13_PRMXUP17_PU50+HARVESTUP17FS+MINIAODMCUP17FS.log

which reports the following modules being run at the time of the crash

Module: PFProducer:particleFlowTmp (crashed)
Module: GsfElectronProducer:gedGsfElectronsTmp
Module: PFBlockProducer:particleFlowBlock
Module: none
Module: none

with the stack being

Thread 8 (Thread 0xfffee4e48460 (LWP 171263)):
#0  0x0000ffff8a5f4e24 in poll () from /lib64/libc.so.6
#1  0x0000ffff857d4a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffff857d51cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffff857d61cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000fffee00000ac in ?? ()
#6  0x0000fffee4e46520 in ?? ()
#7  0x0000fffee4e465f0 in ?? ()
#8  0x4019780aea94a1a7 in ?? ()
Dr15Jones commented 4 years ago

Another corrupted stack is from https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/27434.0_TTbar_14TeV+2026D58+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D58+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log#/158-158

with running modules

Module: PFProducer:particleFlowTmpBarrel (crashed)
Module: TrackstersProducer:ticlTrackstersTrk
Module: none
Module: TrackstersProducer:ticlTrackstersHFNoseMIP
Module: none

Notice that the problem happens again in PFProducer.

Dr15Jones commented 4 years ago

Another corrupted stack is https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/25400.0_ZEE_13+FS_ZEE_13_UP15_PU25+HARVESTUP15FS+MINIAODMCUP15FS/step1_ZEE_13+FS_ZEE_13_UP15_PU25+HARVESTUP15FS+MINIAODMCUP15FS.log#/

with running modules

Module: PFProducer:particleFlowTmp (crashed)
Module: none
Module: TrackAssociatorEDProducer:trackingParticleRecoTrackAsssociation
Module: TrackAssociatorEDProducer:trackingParticleRecoTrackAsssociation
Module: none
Dr15Jones commented 4 years ago

Here is a crash within ROOT https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/23434.99_TTbar_14TeV+2026D49PU_PMXS1S2+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU/step2_TTbar_14TeV+2026D49PU_PMXS1S2+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14INPUT+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU.log

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 16:35:29 CEST 2020
Thread 14 (Thread 0xffff2f868460 (LWP 82656)):
#2  0x0000ffff929543b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff94ded18c in __exp_finite () from /lib64/libm.so.6
#5  0x0000ffff94dfc5dc in exp () from /lib64/libm.so.6
#6  0x0000ffff92289628 in CLHEP::RandPoissonQ::poissonDeviateSmall (mean=1.5101454679374475, e=0xffff51def5d0) at /home/cmsbld/jenkins_b/workspace/auto-builds/CMSSW_11_2_0_pre1-slc7_aarch64_gcc820/build/CMSSW_11_2_0_pre1-build/BUILD/slc7_aarch64_gcc820/external/clhep/2.4.1.3-ghbfee/clhep-2.4.1.3/Random/src/RandPoissonQ.cc:299
#7  CLHEP::RandPoissonQ::poissonDeviateSmall (e=0xffff51def5d0, mean=1.5101454679374475) at /home/cmsbld/jenkins_b/workspace/auto-builds/CMSSW_11_2_0_pre1-slc7_aarch64_gcc820/build/CMSSW_11_2_0_pre1-build/BUILD/slc7_aarch64_gcc820/external/clhep/2.4.1.3-ghbfee/clhep-2.4.1.3/Random/src/RandPoissonQ.cc:257
#8  0x0000ffff319cd3ac in EcalHitResponse::analogSignalAmplitude(DetId const&, double, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimCalorimetryEcalSimAlgos.so
#9  0x0000ffff319cd514 in EcalHitResponse::putAnalogSignal(PCaloHit const&, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimCalorimetryEcalSimAlgos.so
#10 0x0000ffff319cb4e0 in EcalTDigitizer<EBDigitizerTraits>::add(std::vector<PCaloHit, std::allocator<PCaloHit> > const&, int, CLHEP::HepRandomEngine*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimCalorimetryEcalSimAlgos.so
#11 0x0000ffff31a0ef28 in EcalDigiProducer::accumulateCaloHits(edm::Handle<std::vector<PCaloHit, std::allocator<PCaloHit> > > const&, edm::Handle<std::vector<PCaloHit, std::allocator<PCaloHit> > > const&, edm::Handle<std::vector<PCaloHit, std::allocator<PCaloHit> > > const&, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimCalorimetryEcalSimProducers.so
#12 0x0000ffff31a1191c in EcalDigiProducer::accumulate(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimCalorimetryEcalSimProducers.so
#13 0x0000ffff363fa590 in edm::MixingModule::accumulateEvent(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#14 0x0000ffff363fa6dc in edm::MixingModule::pileAllWorkers(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#15 0x0000ffff36404798 in void edm::PileUp::readPileUp<std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)> >(edm::EventID const&, std::vector<edm::SecondaryEventIDAndFileInfo, std::allocator<edm::SecondaryEventIDAndFileInfo> >&, std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)>, int, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#16 0x0000ffff363fba18 in edm::MixingModule::doPileUp(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#17 0x0000ffff36341bc8 in edm::BMixingModule::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libMixingBase.so
[cut]
Thread 10 (Thread 0xffff32ef8460 (LWP 61441)):
#2  0x0000ffff929543b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff3dcceac0 in void std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> >::_M_realloc_insert<HepMC::GenParticle* const&>(__gnu_cxx::__normal_iterator<HepMC::GenParticle**, std::vector<HepMC::GenParticle*, std::allocator<HepMC::GenParticle*> > >, HepMC::GenParticle* const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimDataFormatsGeneratorProducts.so
#5  0x0000ffff3dcceab8 in hepmc_rootio::add_to_particles_in(HepMC::GenVertex*, HepMC::GenParticle*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libSimDataFormatsGeneratorProducts.so
#6  0x0000ffff95f17004 in int TStreamerInfo::ReadBufferArtificial<char**>(TBuffer&, char** const&, TStreamerElement*, int, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libRIO.so
#7  0x0000ffff95fce9c0 in int TStreamerInfo::ReadBuffer<char**>(TBuffer&, char** const&, TStreamerInfo::TCompInfo* const*, int, int, int, int, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libRIO.so
[cut]
#191 0x0000ffff964dd768 in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libTree.so
#192 0x0000ffff362cf1bc in edm::RootTree::getEntry(TBranch*, long long) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginIOPoolInput.so
#193 0x0000ffff362a7728 in edm::RootDelayedReader::getProduct_(edm::BranchID const&, edm::EDProductGetter const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginIOPoolInput.so
#194 0x0000ffff96cf4744 in edm::DelayedReader::getProduct(edm::BranchID const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#195 0x0000ffff96daea38 in edm::InputProductResolver::resolveProduct_(edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#196 0x0000ffff96d9df2c in edm::Principal::findProductByLabel(edm::KindOfType, edm::TypeID const&, edm::InputTag const&, edm::EDConsumerBase const*, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#197 0x0000ffff96d9e148 in edm::Principal::getByLabel(edm::KindOfType, edm::TypeID const&, edm::InputTag const&, edm::EDConsumerBase const*, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#198 0x0000ffff3b17d490 in cms::PileupVertexAccumulator::accumulate(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralPileupInformationPlugins.so
#199 0x0000ffff363fa590 in edm::MixingModule::accumulateEvent(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#200 0x0000ffff363fa6dc in edm::MixingModule::pileAllWorkers(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#201 0x0000ffff36404798 in void edm::PileUp::readPileUp<std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)> >(edm::EventID const&, std::vector<edm::SecondaryEventIDAndFileInfo, std::allocator<edm::SecondaryEventIDAndFileInfo> >&, std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)>, int, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#202 0x0000ffff363fba18 in edm::MixingModule::doPileUp(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#203 0x0000ffff36341bc8 in edm::BMixingModule::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libMixingBase.so
[cut]
Thread 9 (Thread 0xffff33908460 (LWP 61439)):
#0  0x0000ffff94c8a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff94c8a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffff929543b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff968bded4 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, unsigned long>, std::_Select1st<std::pair<unsigned int const, unsigned long> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned int const, unsigned long> >*)@plt () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreMessageLogger.so
#5  0x0000ffff968cc8e8 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, unsigned long>, std::_Select1st<std::pair<unsigned int const, unsigned long> >, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >::_M_erase(std::_Rb_tree_node<std::pair<unsigned int const, unsigned long> >*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreMessageLogger.so
[cut]
#18 0x0000ffff3175c1f4 in void TrackingTruthAccumulator::accumulateEvent<PileUpEventPrincipal>(PileUpEventPrincipal const&, edm::EventSetup const&, edm::Handle<edm::HepMCProduct> const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralTrackingAnalysisPlugins.so
#19 0x0000ffff31755a20 in TrackingTruthAccumulator::accumulate(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralTrackingAnalysisPlugins.so
#20 0x0000ffff363fa590 in edm::MixingModule::accumulateEvent(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#21 0x0000ffff363fa6dc in edm::MixingModule::pileAllWorkers(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#22 0x0000ffff36404798 in void edm::PileUp::readPileUp<std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)> >(edm::EventID const&, std::vector<edm::SecondaryEventIDAndFileInfo, std::allocator<edm::SecondaryEventIDAndFileInfo> >&, std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)>, int, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#23 0x0000ffff363fba18 in edm::MixingModule::doPileUp(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#24 0x0000ffff36341bc8 in edm::BMixingModule::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libMixingBase.so
[cut]
Thread 1 (Thread 0xffff94500000 (LWP 202701)):
#0  0x0000ffff94cb4e24 in poll () from /lib64/libc.so.6
#1  0x0000ffff92954a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffff929551cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffff929561cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffff962d6720 in TThread::SelfId()@plt () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libThread.so
#6  0x0000ffff962ee7c0 in TThread::Tsd(void*, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libThread.so
#7  0x0000ffff95938d18 in TClass::GetCollectionProxy() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libCore.so
#8  0x0000ffff95ed96c0 in int TStreamerInfoActions::ReadSTL<&TStreamerInfoActions::ReadSTLMemberWiseSameClass, &TStreamerInfoActions::ReadSTLObjectWiseFastArray>(TBuffer&, void*, TStreamerInfoActions::TConfiguration const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libRIO.so
#9  0x0000ffff95da63e4 in TBufferFile::ReadClassBuffer(TClass const*, void*, TClass const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libRIO.so
[cut]
/cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libTree.so
#138 0x0000ffff964dd658 in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libTree.so
#139 0x0000ffff964dd768 in TBranchElement::GetEntry(long long, int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/external/slc7_aarch64_gcc820/lib/libTree.so
#140 0x0000ffff362cf1bc in edm::RootTree::getEntry(TBranch*, long long) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginIOPoolInput.so
#141 0x0000ffff362a7728 in edm::RootDelayedReader::getProduct_(edm::BranchID const&, edm::EDProductGetter const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginIOPoolInput.so
#142 0x0000ffff96cf4744 in edm::DelayedReader::getProduct(edm::BranchID const&, edm::EDProductGetter const*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#143 0x0000ffff96daea38 in edm::InputProductResolver::resolveProduct_(edm::Principal const&, bool, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#144 0x0000ffff96d9df2c in edm::Principal::findProductByLabel(edm::KindOfType, edm::TypeID const&, edm::InputTag const&, edm::EDConsumerBase const*, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#145 0x0000ffff96d9e148 in edm::Principal::getByLabel(edm::KindOfType, edm::TypeID const&, edm::InputTag const&, edm::EDConsumerBase const*, edm::SharedResourcesAcquirer*, edm::ModuleCallingContext const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#146 0x0000ffff3b17d490 in cms::PileupVertexAccumulator::accumulate(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralPileupInformationPlugins.so
#147 0x0000ffff363fa590 in edm::MixingModule::accumulateEvent(PileUpEventPrincipal const&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#148 0x0000ffff363fa6dc in edm::MixingModule::pileAllWorkers(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#149 0x0000ffff36404798 in void edm::PileUp::readPileUp<std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)> >(edm::EventID const&, std::vector<edm::SecondaryEventIDAndFileInfo, std::allocator<edm::SecondaryEventIDAndFileInfo> >&, std::_Bind<bool (edm::MixingModule::*(std::reference_wrapper<edm::MixingModule>, std::_Placeholder<1>, edm::ModuleCallingContext const*, int, std::_Placeholder<2>, int, std::reference_wrapper<edm::EventSetup const>, edm::StreamID))(edm::EventPrincipal const&, edm::ModuleCallingContext const*, int, int, int&, edm::EventSetup const&, edm::StreamID const&)>, int, edm::StreamID const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#150 0x0000ffff363fba18 in edm::MixingModule::doPileUp(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginSimGeneralMixingModulePlugins.so
#151 0x0000ffff36341bc8 in edm::BMixingModule::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libMixingBase.so

Current Modules:

Module: MixingModule:mix (crashed)
Module: MixingModule:mix
Module: none
Module: none
Module: MixingModule:mix

Here we have some incredibly deep stacks (because of ROOT IO) and the crash is ROOT's thread local handling.

Dr15Jones commented 4 years ago

This crash https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/10059.0_QCD_Pt_3000_3500_13+2017+QCD_Pt_3000_3500_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+Reco+HARVEST+ALCA+Nano/step3_QCD_Pt_3000_3500_13+2017+QCD_Pt_3000_3500_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+Reco+HARVEST+ALCA+Nano.log#/

did not generate a trace back but the running modules were

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)
Module: RecoTauProducer:combinatoricRecoTausBoosted
Module: none
Module: LowPtGsfElectronSeedProducer:lowPtGsfElectronSeeds
Module: none

Again we see a crash happening in Type0PFMETcorrInputProducer.

Dr15Jones commented 4 years ago

Here in https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/2018.7_H125GGgluonfusion_13_UP18+H125GGgluonfusionFS_13_UP18+HARVESTUP18FS+MINIAODMCUP18FS/step1_H125GGgluonfusion_13_UP18+H125GGgluonfusionFS_13_UP18+HARVESTUP18FS+MINIAODMCUP18FS.log#/

we have another corrupted stack. This time the only module reported running is

Module: PFProducer:particleFlowTmp (crashed)

although the stack traces for the threads do show 3 other modules running.

Dr15Jones commented 4 years ago

Here in https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.898_RunParkingBPH2018B+RunParkingBPH2018B+HLTDR2_2018+RECODR2_2018reHLT_skimParkingBPH_Offline+HARVEST2018/step3_RunParkingBPH2018B+RunParkingBPH2018B+HLTDR2_2018+RECODR2_2018reHLT_skimParkingBPH_Offline+HARVEST2018.log#/

we have another corrupted stack with modules running:

`` Module: PFProducer:particleFlowTmp (crashed) Module: TrackingRecoMaterialAnalyser:materialDumperAnalyzer Module: TrackingMonitor:TrackerCollisionSelectedTrackMonCommonhighPurityPtRange0to1 Module: none Module: TrackingMonitor:TrackerCollisionSelectedTrackMonCommonhighPurityPtRange0to1

Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.878_RunMuonEG2018C+RunMuonEG2018C+HLTDR2_2018+RECODR2_2018reHLT_skimMuonEG_Offline+HARVEST2018/step3_RunMuonEG2018C+RunMuonEG2018C+HLTDR2_2018+RECODR2_2018reHLT_skimMuonEG_Offline+HARVEST2018.log#/ also shows a corrupted stack with crash in

Module: PFProducer:particleFlowTmp (crashed)
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.873_RunHLTPhy2018C+RunHLTPhy2018C+HLTDR2_2018+RECODR2_2018reHLT_Offline+HARVEST2018/step3_RunHLTPhy2018C+RunHLTPhy2018C+HLTDR2_2018+RECODR2_2018reHLT_Offline+HARVEST2018.log#/ doesn't have a stacktrace (it timed out) but shows the crash happened in

Module: PFProducer:particleFlowTmp (crashed)
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.852_RunJetHT2018A+RunJetHT2018A+HLTDR2_2018+RECODR2_2018reHLT_skimJetHT_Offline+HARVEST2018/step3_RunJetHT2018A+RunJetHT2018A+HLTDR2_2018+RECODR2_2018reHLT_skimJetHT_Offline+HARVEST2018.log#/

didn't have a traceback (it says it timed out) and shows running modules as

Module: ShiftedParticleProducer:shiftedPatTauEnUp (crashed)
Module: RecoTauProducer:pfTausCombiner
Module: none
Module: none
Module: none
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.842_RunDisplacedJet2017E+RunDisplacedJet2017E+HLTDR2_2017+RECODR2_2017reHLT_skimDisplacedJet_Prompt+HARVEST2017/step3_RunDisplacedJet2017E+RunDisplacedJet2017E+HLTDR2_2017+RECODR2_2017reHLT_skimDisplacedJet_Prompt+HARVEST2017.log#/ shows a corrupted stack with the crash in

Module: PFProducer:particleFlowTmp (crashed)
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.826_RunMuOnia2017E+RunMuOnia2017E+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017/step3_RunMuOnia2017E+RunMuOnia2017E+HLTDR2_2017+RECODR2_2017reHLT_skimMuOnia_Prompt+HARVEST2017.log#/ has a corrupted stack trace with crash in

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)
Module: BTagPerformanceAnalyzerOnData:bTagAnalysis
Module: MuonIdProducer:muons1stStep
Module: none
Module: PATTauProducer:patTaus
Dr15Jones commented 4 years ago

The crashes almost invariable happen during the first 4 events so are most likely a 1st time called related problem.

Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.746_RunMuOnia2016C+RunMuOnia2016C+HLTDR2_2016+RECODR2_2016reHLT_skimMuOnia_HIPM+HARVESTDR2/step3_RunMuOnia2016C+RunMuOnia2016C+HLTDR2_2016+RECODR2_2016reHLT_skimMuOnia_HIPM+HARVESTDR2.log#/

shows a corrupted stack trace with crash in

Module: PFProducer:particleFlowTmp (crashed)
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.738_RunDoubleMuon2016C+RunDoubleMuon2016C+HLTDR2_2016+RECODR2_2016reHLT_HIPM+HARVESTDR2/step2_RunDoubleMuon2016C+RunDoubleMuon2016C+HLTDR2_2016+RECODR2_2016reHLT_HIPM+HARVESTDR2.log#/144-144

has a crash in TBB's internals

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 16:33:01 CEST 2020
Thread 12 (Thread 0xffff594f8460 (LWP 94166)):
#2  0x0000ffffba0343b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  std::_Rb_tree_insert_and_rebalance (__insert_left=false, __x=0xfffcc0699930, __p=0xfffcc0699900, __header=...) at ../../../../../libstdc++-v3/src/c++98/tree.cc:203
#5  0x0000ffffac4dfc70 in std::pair<std::_Rb_tree_iterator<std::pair<DDName const, DDI::rep_type<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >*> >, bool> std::_Rb_tree<DDName, std::pair<DDName const, DDI::rep_type<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >*>, std::_Select1st<std::pair<DDName const, DDI::rep_type<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >*> >, std::less<DDName>, std::allocator<std::pair<DDName const, DDI::rep_type<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >*> > >::_M_emplace_unique<DDName const&, DDI::rep_type<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >*&>(DDName const&, DDI::rep_type<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >*&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDetectorDescriptionCore.so
#6  0x0000ffffac4dfe98 in DDI::Store<DDName, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> >, std::unique_ptr<ROOT::Math::Rotation3D, std::default_delete<ROOT::Math::Rotation3D> > >::create(DDName const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDetectorDescriptionCore.so
#7  0x0000ffffac4de5f8 in DDRotation::DDRotation(DDName const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDetectorDescriptionCore.so
#8  0x0000ffffa455155c in DDLPosPart::processElement(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, DDCompactView&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDetectorDescriptionParser.so
#9  0x0000ffffa4558424 in DDLSAX2FileHandler::endElement(unsigned short const*, unsigned short const*, unsigned short const*) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDetectorDescriptionParser.so
#10 0x0000ffffab9d2718 in xercesc_3_1::SAX2XMLReaderImpl::endElement (this=0xfffcc05b0788, elemDecl=..., uriId=1, isRoot=false, elemPrefix=0xfffcc0762800) at xercesc/parsers/SAX2XMLReaderImpl.cpp:889
#11 0x0000ffffab97eb80 in xercesc_3_1::IGXMLScanner::scanEndTag (this=this@entry=0xfffcc0641e08, gotData=@0xffff594f6427: true) at ./xercesc/framework/XMLBuffer.hpp:171
#12 0x0000ffffab982e58 in xercesc_3_1::IGXMLScanner::scanContent (this=this@entry=0xfffcc0641e08) at xercesc/internal/IGXMLScanner.cpp:881
#13 0x0000ffffab982fa8 in xercesc_3_1::IGXMLScanner::scanDocument (this=0xfffcc0641e08, src=...) at xercesc/internal/IGXMLScanner.cpp:217
#14 0x0000ffffab9d344c in xercesc_3_1::SAX2XMLReaderImpl::parse (this=0xfffcc05b0788, source=...) at xercesc/parsers/SAX2XMLReaderImpl.cpp:409
#15 0x0000ffffa454cfb0 in DDLParser::parse(std::vector<unsigned char, std::allocator<unsigned char> > const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDetectorDescriptionParser.so
#16 0x0000ffffa472386c in magneticfield::VolumeBasedMagneticFieldESProducerFromDB::produce(IdealMagneticFieldRecord const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginMagneticFieldGeomBuilderPlugins.so
#17 0x0000ffffa472d16c in decltype ({parm#1}()) edm::convertException::wrap<edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}>(edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginMagneticFieldGeomBuilderPlugins.so
#18 0x0000ffffa472d358 in edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginMagneticFieldGeomBuilderPlugins.so
#19 0x0000ffffa472e158 in void edm::SerialTaskQueueChain::actionToRun<edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&>(edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginMagneticFieldGeomBuilderPlugins.so
#20 0x0000ffffa472e1fc in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}>(edm::eventsetup::Callback<magneticfield::VolumeBasedMagneticFieldESProducerFromDB, std::unique_ptr<MagneticField, std::default_delete<MagneticField> >, IdealMagneticFieldRecord, edm::eventsetup::CallbackSimpleDecorator<IdealMagneticFieldRecord> >::runProducerAsync(std::__exception_ptr::exception_ptr const*, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&)::{lambda()#1}&&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginMagneticFieldGeomBuilderPlugins.so
#21 0x0000ffffbc93f648 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0xffffa038fe00, context_guard=..., t=t@entry=0xffffa033c340, isolation=isolation@entry=0) at ../../include/tbb/machine/gcc_generic.h:101
#22 0x0000ffffbc93f89c in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffa038fe00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#23 0x0000ffffbc93ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffbe604e60 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#24 0x0000ffffbe3e02f0 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#25 0x0000ffffbe44d4bc in edm::eventsetup::EventSetupRecordImpl::getFromProxy(edm::eventsetup::DataKey const&, edm::eventsetup::ComponentDescription const*&, bool, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#26 0x0000ffff5c8d0e54 in L1TMuon::GeometryTranslator::checkAndUpdateGeometry(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuon.so
#27 0x0000ffff5c951218 in EMTFSetup::reload(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonEndCap.so
#28 0x0000ffff5c990cc0 in TrackFinder::process(edm::Event const&, edm::EventSetup const&, std::vector<l1t::EMTFHit, std::allocator<l1t::EMTFHit> >&, std::vector<l1t::EMTFTrack, std::allocator<l1t::EMTFTrack> >&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonEndCap.so
#29 0x0000ffff5a5c7330 in L1TMuonEndCapTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginL1TriggerL1TMuonEndCapPlugins.so
[cut]
Thread 11 (Thread 0xffff59f08460 (LWP 94165)):
#2  0x0000ffffba0343b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffbcbc6374 in bitmap_sfu (binfo=<optimized out>, bitmap=<optimized out>) at include/jemalloc/internal/bit_util.h:22
#5  arena_slab_reg_alloc_batch (ptrs=<optimized out>, cnt=25, bin_info=<optimized out>, slab=0xfffca002b940) at src/arena.c:296
#6  je_arena_tcache_fill_small (tsdn=0xffff59f0df60, arena=0xfffca0000c80, tcache=<optimized out>, tbin=0xffff59f0e1c8, binind=3, prof_accumbytes=<optimized out>) at src/arena.c:1402
#7  0x0000ffffbcc077fc in je_tcache_alloc_small_hard (tsdn=tsdn@entry=0xffff59f0df60, arena=arena@entry=0xfffca0000c80, tcache=tcache@entry=0xffff59f0e170, tbin=tbin@entry=0xffff59f0e1c8, binind=<optimized out>, tcache_success=tcache_success@entry=0xffff59f06b28) at src/tcache.c:94
#8  0x0000ffffbcbbcca0 in tcache_alloc_small (slow_path=false, zero=false, binind=<optimized out>, size=<optimized out>, tcache=0xffff59f0e170, arena=0xfffca0000c80, tsd=0xffff59f0df60) at include/jemalloc/internal/tsd.h:228
#9  arena_malloc (slow_path=false, tcache=0xffff59f0e170, zero=false, ind=<optimized out>, size=<optimized out>, arena=0x0, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:165
#10 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0xffff59f0e170, zero=false, ind=<optimized out>, size=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
#11 imalloc_no_sample (ind=<optimized out>, usize=48, size=<optimized out>, tsd=<optimized out>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:1949
#12 imalloc_body (tsd=<optimized out>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2149
#13 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2260
#14 je_malloc_default (size=<optimized out>) at src/jemalloc.c:2291
#15 0x0000ffffbcbbd254 in malloc (size=size@entry=40) at src/jemalloc.c:2390
#16 0x0000ffffbcc0be3c in newImpl<false> (size=40) at src/jemalloc_cpp.cpp:77
#17 operator new (size=40) at src/jemalloc_cpp.cpp:87
#18 0x0000ffffbddaddec in std::_Rb_tree_node<std::pair<short const, short> >* std::_Rb_tree<short, std::pair<short const, short>, std::_Select1st<std::pair<short const, short> >, std::less<short>, std::allocator<std::pair<short const, short> > >::_M_clone_node<std::_Rb_tree<short, std::pair<short const, short>, std::_Select1st<std::pair<short const, short> >, std::less<short>, std::allocator<std::pair<short const, short> > >::_Alloc_node>(std::_Rb_tree_node<std::pair<short const, short> > const*, std::_Rb_tree<short, std::pair<short const, short>, std::_Select1st<std::pair<short const, short> >, std::less<short>, std::allocator<std::pair<short const, short> > >::_Alloc_node&) [clone .isra.1496] () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libDataFormatsStdDictionaries.so
[cut]
#25 0x0000ffff5aac8724 in std::vector<std::map<short, short, std::less<short>, std::allocator<std::pair<short const, short> > >, std::allocator<std::map<short, short, std::less<short>, std::allocator<std::pair<short const, short> > > > >::operator=(std::vector<std::map<short, short, std::less<short>, std::allocator<std::pair<short const, short> > >, std::allocator<std::map<short, short, std::less<short>, std::allocator<std::pair<short const, short> > > > > const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#26 0x0000ffff5aad46ec in L1MuBMLUTHandler::L1MuBMLUTHandler(L1TMuonBarrelParams const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#27 0x0000ffff5aaca000 in L1MuBMEUX::run(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#28 0x0000ffff5aad6c10 in L1MuBMSEU::run(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#29 0x0000ffff5aacf7ec in L1MuBMExtrapolationUnit::run(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#30 0x0000ffff5aad7818 in L1MuBMSectorProcessor::run(int, edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#31 0x0000ffff5aae7c68 in L1MuBMTrackFinder::run(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonBarrel.so
#32 0x0000ffff5ab7bbbc in L1TMuonBarrelTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginL1TriggerL1TMuonBarrelPlugins.so
[cut]
Thread 10 (Thread 0xffff5e3c8460 (LWP 78072)):
#0  0x0000ffffbc3a4e24 in poll () from /lib64/libc.so.6
#1  0x0000ffffba034a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffffba0351cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffffba0361cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffffbc396908 in sched_yield () from /lib64/libc.so.6
#6  0x0000ffffbc93e9fc in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffbba3be00, completion_ref_count=@0xffffa03a3828: 2, isolation=0) at ../../src/tbb/mailbox.h:225
#7  0x0000ffffbc93fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffbba3be00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#8  0x0000ffffbc93ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffbe604e60 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#9  0x0000ffffbe3e02f0 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#10 0x0000ffffbe44d4bc in edm::eventsetup::EventSetupRecordImpl::getFromProxy(edm::eventsetup::DataKey const&, edm::eventsetup::ComponentDescription const*&, bool, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#11 0x0000ffff5c8d0e54 in L1TMuon::GeometryTranslator::checkAndUpdateGeometry(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuon.so
#12 0x0000ffff5c951218 in EMTFSetup::reload(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonEndCap.so
#13 0x0000ffff5c990cc0 in TrackFinder::process(edm::Event const&, edm::EventSetup const&, std::vector<l1t::EMTFHit, std::allocator<l1t::EMTFHit> >&, std::vector<l1t::EMTFTrack, std::allocator<l1t::EMTFTrack> >&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonEndCap.so
#14 0x0000ffff5a5c7330 in L1TMuonEndCapTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginL1TriggerL1TMuonEndCapPlugins.so
[cut]
Thread 9 (Thread 0xffff5edd8460 (LWP 78070)):
Thread 1 (Thread 0xffffbbbf0000 (LWP 232142)):
#2  0x0000ffffba0343b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffbc396908 in sched_yield () from /lib64/libc.so.6
#5  0x0000ffffbc93e8d4 in __TBB_Pause () at ../../include/tbb/tbb_machine.h:332
#6  tbb::internal::prolonged_pause () at ../../src/tbb/scheduler_common.h:322
#7  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::receive_or_steal_task (this=0xffffbbae2600, completion_ref_count=@0xffffa03ce828: 2, isolation=0) at ../../src/tbb/custom_scheduler.h:305
#8  0x0000ffffbc93fa20 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0xffffbbae2600, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#9  0x0000ffffbc93ac58 in tbb::interface7::internal::task_arena_base::internal_execute (this=0xffffbe604e60 <edm::esTaskArena()::s_arena>, d=...) at ../../src/tbb/arena.cpp:1105
#10 0x0000ffffbe3e02f0 in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecordImpl const&, edm::eventsetup::DataKey const&, bool, edm::ActivityRegistry const*, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#11 0x0000ffffbe44d4bc in edm::eventsetup::EventSetupRecordImpl::getFromProxy(edm::eventsetup::DataKey const&, edm::eventsetup::ComponentDescription const*&, bool, edm::EventSetupImpl const*) const () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libFWCoreFramework.so
#12 0x0000ffff5c8d0e54 in L1TMuon::GeometryTranslator::checkAndUpdateGeometry(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuon.so
#13 0x0000ffff5c951218 in EMTFSetup::reload(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonEndCap.so
#14 0x0000ffff5c990cc0 in TrackFinder::process(edm::Event const&, edm::EventSetup const&, std::vector<l1t::EMTFHit, std::allocator<l1t::EMTFHit> >&, std::vector<l1t::EMTFTrack, std::allocator<l1t::EMTFTrack> >&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/libL1TriggerL1TMuonEndCap.so
#15 0x0000ffff5a5c7330 in L1TMuonEndCapTrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-09-0000/lib/slc7_aarch64_gcc820/pluginL1TriggerL1TMuonEndCapPlugins.so

Current Modules:

Module: L1TMuonEndCapTrackProducer:simEmtfDigis (crashed)
Module: L1TMuonEndCapTrackProducer:simEmtfDigis
Module: L1TMuonEndCapTrackProducer:simEmtfDigis
Module: none
Module: none

A fatal system signal has occurred: segmentation violation
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/136.722_RunDoubleEG2016B+RunDoubleEG2016B+HLTDR2_2016+RECODR2_2016reHLT_skimDoubleEG_HIPM+HARVESTDR2/step3_RunDoubleEG2016B+RunDoubleEG2016B+HLTDR2_2016+RECODR2_2016reHLT_skimDoubleEG_HIPM+HARVESTDR2.log

has no stack trace and shows the running modules as

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)
Module: BTagPerformanceAnalyzerOnData:bTagAnalysis
Module: none
Module: none
Module: PFCand_AssoMap:pfCandidateToVertexAssociation
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/134.99603_RunSingleMu2015HLHS+RunSingleMu2015HLHS+HLTDR2_25ns+RECODR2_25nsreHLT_HIPM+HARVESTDR2/step3_RunSingleMu2015HLHS+RunSingleMu2015HLHS+HLTDR2_25ns+RECODR2_25nsreHLT_HIPM+HARVESTDR2.log#/

has no stack trace and only reports one module running

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)

The same holds for https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/134.801_RunHLTPhy2015C+RunHLTPhy2015C+HLTDR2_25ns+RECODR2_25nsreHLT_HIPM+HARVESTDR2/step3_RunHLTPhy2015C+RunHLTPhy2015C+HLTDR2_25ns+RECODR2_25nsreHLT_HIPM+HARVESTDR2.log#/

Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/134.706_RunMuonEG2015B+RunMuonEG2015B+HLTDR2_50ns+RECODR2_50nsreHLT_HIPM+HARVESTDR2/step3_RunMuonEG2015B+RunMuonEG2015B+HLTDR2_50ns+RECODR2_50nsreHLT_HIPM+HARVESTDR2.log#/

Has no stack trace and shows the running modules as


Module: PFProducer:particleFlowTmp (crashed)
Module: PFDisplacedVertexCandidateProducer:particleFlowDisplacedVertexCandidate
Module: none
Module: TrackingMonitor:TrackerCollisionSelectedTrackMonCommongeneralTracks
Module: none```
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/129.0_SinglePiPt1+SinglePiPt1+DIGI+RECO/step3_SinglePiPt1+SinglePiPt1+DIGI+RECO.log#/

Seems to be reporting multiple simultaneous crash reports. No stack traces are given

A fatal system signal has occurred: 

A fatal system signal has occurred: segmentation violationsegmentation violation
The following is the call stack containing the origin of the signal.

The following is the call stack containing the origin of the signal.

Mon Aug 10 01:23:11 CEST 2020

Current Modules:

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Mon Aug 10 01:23:12 CEST 2020

Current Modules:

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)
Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)Mon Aug 10 01:23:12 CEST 2020

Module: GlobalRecHitsAnalyzer:globalrechitsanalyze
Module: Type0PFMETcorrInputProducer:patPFMetT0Corr
Module: Type0PFMETcorrInputProducer:patPFMetT0Corr
Module: none

A fatal system signal has occurred: segmentation violation

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/130.0_SinglePiPt10+SinglePiPt10+DIGI+RECO/step3_SinglePiPt10+SinglePiPt10+DIGI+RECO.log#/ seems to have the same sort of behavior.

Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/43.0_ZpMM_2250_8TeV+ZpMM_2250_8TeVINPUT+DIGI+RECO+HARVEST/step3_ZpMM_2250_8TeV+ZpMM_2250_8TeVINPUT+DIGI+RECO+HARVEST.log#/

Doesn't have a traceback and shows the running modules as

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 22:12:36 CEST 2020

Current Modules:

Module: DeepDoubleXONNXJetTagsProducer:pfMassIndependentDeepDoubleCvBJetTagsSlimmedAK8DeepTags (crashed)
Module: none
Module: DeepDoubleXONNXJetTagsProducer:pfMassIndependentDeepDoubleCvBJetTagsSlimmedAK8DeepTags
Module: DeepDoubleXONNXJetTagsProducer:pfMassIndependentDeepDoubleCvBJetTagsSlimmedAK8DeepTags
Module: none
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/23.0_JpsiMM+JpsiMMINPUT+DIGI+RECO+HARVEST/step3_JpsiMM+JpsiMMINPUT+DIGI+RECO+HARVEST.log#/

Did not have a stack trace and seems to have concurrent crashes

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 22:34:18 CEST 2020

Current Modules:

Module: PFRecoTauDiscriminationByIsolationContainer:hpsPFTauBasicDiscriminators (crashed)

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 22:34:19 CEST 2020

Current Modules:

Module: PFRecoTauDiscriminationByIsolationContainer:hpsPFTauBasicDiscriminators (crashed)timeout: the monitored command dumped core

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/4.24_WMuSkim2011A+WMuSkim2011A+HLTDSKIM+RECODR1reHLT+HARVESTDR1reHLT/step3_WMuSkim2011A+WMuSkim2011A+HLTDSKIM+RECODR1reHLT+HARVESTDR1reHLT.log#/

is similar with only info of

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Mon Aug 10 00:07:24 CEST 2020

Current Modules:

Module: PFRecoTauDiscriminationByIsolationContainer:hpsPFTauBasicDiscriminatorsdR03 (crashed)timeout: the monitored command dumped core
Dr15Jones commented 4 years ago

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-09-0000/pyRelValMatrixLogs/run/16.0_SingleElectronPt1000+SingleElectronPt1000INPUT+DIGI+RECO+HARVEST/step3_SingleElectronPt1000+SingleElectronPt1000INPUT+DIGI+RECO+HARVEST.log#/

has no stack trace and has running modules

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Aug  9 22:58:54 CEST 2020

Current Modules:

Module: DeepDoubleXONNXJetTagsProducer:pfDeepDoubleBvLJetTagsSlimmedAK8DeepTags (crashed)
Module: DeepDoubleXONNXJetTagsProducer:pfDeepDoubleBvLJetTagsSlimmedAK8DeepTags
Module: DeepDoubleXONNXJetTagsProducer:pfDeepDoubleBvLJetTagsSlimmedAK8DeepTags
Module: none
Module: none
Dr15Jones commented 4 years ago

https://github.com/cms-sw/cmssw/issues/31123#issuecomment-672892406

looking at step1 for workflow 250408.17 in the most recent ASAN build https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc820/CMSSW_11_2_UBSAN_X_2020-08-07-2300/pyRelValMatrixLogs/run/250408.17_QCD_FlatPt_15_3000HS_13+FS_QCD_FlatPt_15_3000HS_13_PRMXUP17_PU50+HARVESTUP17FS+MINIAODMCUP17FS/step1_QCD_FlatPt_15_3000HS_13+FS_QCD_FlatPt_15_3000HS_13_PRMXUP17_PU50+HARVESTUP17FS+MINIAODMCUP17FS.log

I found a report of an error in PFProducer

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/29588d36ea071640a9faf10b3c7a6322/opt/cmssw/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_UBSAN_X_2020-08-07-2300/src/RecoParticleFlow/PFProducer/src/PFEGammaAlgo.cc:99:41: runtime error: member call on address 0x2ab7709cea00 which does not point to an object of type 'PFBlockElementGsfTrack'
0x2ab7709cea00: note: object is of type 'reco::PFBlockElementTrack'
 b7 2a 00 00  f0 88 ef 37 b6 2a 00 00  01 00 00 00 01 2a 00 00  11 00 00 00 00 00 00 00  01 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'reco::PFBlockElementTrack'

perhaps this is related to the problems on ARM64.

Dr15Jones commented 4 years ago

@guitargeek you may want to look at the previous comment in this issue related to PFEGammaAlgo

mrodozov commented 4 years ago

The routine RPCSimSetUp::setRPCSetUp does a tremendous amount of output formatting which is then never seen because the resulting string is passed to LogDebug. See

https://github.com/cms-sw/cmssw/blob/5b54e3a1fc64b1d9764a31ae73226f8f67428f52/SimMuon/RPCDigitizer/src/RPCSimSetUp.cc#L106

@mileva

guitargeek commented 4 years ago

Hi, there is already an open issue for the bad cast in PFEGammaAlgo: the https://github.com/cms-sw/cmssw/issues/27553

perrotta commented 4 years ago

assign reconstruction, simulation

cmsbuild commented 4 years ago

New categories assigned: reconstruction,simulation

@mdhildreth,@slava77,@perrotta,@jpata,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

dan131riley commented 4 years ago

I see several with corrupted stack traces that nevertheless include TFormula in the crashed stack, e.g.

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-13-1100/pyRelValMatrixLogs/run/136.865_RunMET2018B+RunMET2018B+HLTDR2_2018+RECODR2_2018reHLT_skimMET_Offline+HARVEST2018/step3_RunMET2018B+RunMET2018B+HLTDR2_2018+RECODR2_2018reHLT_skimMET_Offline+HARVEST2018.log

Thread 5 (Thread 0xffff56448460 (LWP 170333)):
#0  0x0000ffffba664e24 in poll () from /lib64/libc.so.6
#1  0x0000ffffb5cb4a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02641/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-12-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffffb5cb51cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02641/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-12-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffffb5cb61cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02641/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-12-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  &lt;signal handler called&gt;
#5  0x0000ffff4000007c in ?? ()
#6  0x0000ffffb25dd5e8 in TFormula::Eval(double) const () from /cvmfs/cms-ib.cern.ch/nweek-02641/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-13-1100/external/slc7_aarch64_gcc820/lib/libHist.so
#7  0x0000ffffb25dd5e8 in TFormula::Eval(double) const () from /cvmfs/cms-ib.cern.ch/nweek-02641/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-13-1100/external/slc7_aarch64_gcc820/lib/libHist.so
#8  0x0000fffe23ba3600 in ?? ()

That claims to have crashed in PFRecoTauDiscriminationByIsolationContainer, which does use TFormula. So does Type0PFMETcorrInputProducer, which shows up several times as crashed. We've also previously seen extremely rare crashes in TFormula on AMD64, for example in PFProducer:particleFlowTmp, see issue #22300. That producer also shows up in several of the aarch crashes. I'm guessing that we're seeing a threading issue in TFormula, possibly initialization, which happens very rarely on AMD64 but much more frequently on aarch64.

dan131riley commented 4 years ago

There may be a potential race condition here,

https://github.com/root-project/root/blob/084292ab638923f9260d3a9f813227ada0728565/hist/hist/src/TFormula.cxx#L470

where it looks like the TFormula is inserted into the global list before it is completely constructed, opening the possibility for another thread to use an incompletely constructed object (an anti-pattern we’ve seen before). If that happened while the formula was being compiled it could explain the funny stack traces.

The problem with this theory is that typically TFormulas are constructed in module constructors, and I don’t think we can get the race condition in that case?

dan131riley commented 4 years ago

I've opened https://sft.its.cern.ch/jira/browse/ROOT-10995 for inserting an incomplete TFormula in the global list, and https://sft.its.cern.ch/jira/browse/ROOT-10994 for the issue in #31097 with thread safety of a TFormula read from a file.

makortel commented 4 years ago

Only for the record, https://github.com/cms-sw/cmssw/issues/31101 has one additional case with a crash in

Module: PFProducer:particleFlowTmp (crashed)
Dr15Jones commented 4 years ago

Here is an interesting one https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-17-2300/pyRelValMatrixLogs/run/18.0_SingleGammaPt10+SingleGammaPt10INPUT+DIGI+RECO+HARVEST/step3_SingleGammaPt10+SingleGammaPt10INPUT+DIGI+RECO+HARVEST.log#/

with the traceback being

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue Aug 18 04:32:17 CEST 2020
[irrelevant threads cut]
Thread 5 (Thread 0xffff54f58460 (LWP 90767)):
#0  0x0000ffffb8d7a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffffb8d7a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffffb35343b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffb8e6d02c in pthread_kill () from /lib64/libpthread.so.0
#5  0x0000ffffb3536834 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#6  <signal handler called>
#7  0x0000ffff400000b8 in ?? ()
#8  0x0000ffff54f57380 in ?? ()
#9  0x000c001200033160 in ?? ()
Thread 4 (Thread 0xffff55968460 (LWP 90763)):
#0  0x0000ffffb8da4e24 in poll () from /lib64/libc.so.6
#1  0x0000ffffb3534a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffffb35351cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffffb35361cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffff400000b8 in ?? ()
#6  0x0000ffff55967380 in ?? ()
#7  0x000c001200033160 in ?? ()
Thread 3 (Thread 0xffff56378460 (LWP 90762)):
#0  0x0000ffffb8d7a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffffb8d7a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffffb35343b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffb8e6d02c in pthread_kill () from /lib64/libpthread.so.0
#5  0x0000ffffb3536834 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#6  <signal handler called>
#7  0x0000ffff400000b8 in ?? ()
#8  0x0000ffff56377380 in ?? ()
#9  0x000c001200033160 in ?? ()
[cut Thread 2]
Thread 1 (Thread 0xffffb85f0000 (LWP 87305)):
#0  0x0000ffffb8d7a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffffb8d7a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffffb35343b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffffb8e6d02c in pthread_kill () from /lib64/libpthread.so.0
#5  0x0000ffffb3536834 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-08-17-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#6  <signal handler called>
#7  0x0000ffff400000b8 in ?? ()
#8  0x0000ffffc5013ff0 in ?? ()

Current Modules:

Module: Type0PFMETcorrInputProducer:patPFMetT0Corr (crashed)
Module: Type0PFMETcorrInputProducer:patPFMetT0Corr
Module: Type0PFMETcorrInputProducer:patPFMetT0Corr
Module: none
Module: Type0PFMETcorrInputProducer:patPFMetT0Corr

A fatal system signal has occurred: segmentation violation

We have 4 simultaneous crashes in Type0PFMETcorrInputProducer all with unreadable stack values. Of interest, note the address of the function which has the segmentation fault is identical for all 4 (0x0000ffff400000b8) while the penultimate function address on the stack are often different: 0x0000ffffc5013ff0, 0x0000ffff56377380, and twice 0x0000ffff55967380. If given, the antepenultimate are all the same 0x000c001200033160.

dan131riley commented 4 years ago

If this were due to an object being used before it was fully constructed, I'd expect at least one thread (the one constructing the object) to succeed. To get all four threads to crash I think we need overlapping initializations, which I think leads back to a missing atomic or mutex that happens to be OK on AMD64 but breaks with the more aggressive ARM64 memory (in)consistency model. For how they all end up at the same address, the two obvious suspects are the global list of functions from gROOT->GetListOfFunctions(), and there's a local mapping

static std::unordered_map<std::string,  void *> gClingFunctions = std::unordered_map<std::string,  void * >();

Access to both appear to have appropriate mutexes.

p.s. how long have you been waiting to use "antepenultimate"?

Dr15Jones commented 4 years ago

I found another interesting one this morning https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-19-2300/pyRelValMatrixLogs/run/136.767_RunZeroBias2016E+RunZeroBias2016E+HLTDR2_2016+RECODR2_2016reHLT_ZB_HIPM+HARVESTDR2ZB/step3_RunZeroBias2016E+RunZeroBias2016E+HLTDR2_2016+RECODR2_2016reHLT_ZB_HIPM+HARVESTDR2ZB.log

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Aug 20 06:32:41 CEST 2020
[Thread 6-11 cut]
Thread 5 (Thread 0xffff37b78460 (LWP 45714)):
#0  0x0000ffff9b35a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff9b35a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffff968543b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff3ae86090 in LinearGridInterpolator3D::interpolate(float, float, float) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libMagneticFieldInterpolation.so
#5  0x0000ffff3ae88a4c in RectangularCylindricalMFGrid::uncheckedValueInTesla(Point3DBase<float, LocalTag> const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libMagneticFieldInterpolation.so
#6  0x0000ffff3ae861d8 in MFGrid3D::valueInTesla(Point3DBase<float, LocalTag> const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libMagneticFieldInterpolation.so
#7  0x0000ffff3fd85238 in MagVolume::fieldInTesla(Point3DBase<float, GlobalTag> const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libMagneticFieldVolumeGeometry.so
#8  0x0000ffff3fd84bcc in MagVolume::inTesla(Point3DBase<float, GlobalTag> const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libMagneticFieldVolumeGeometry.so
#9  0x0000ffff3f909344 in SteppingHelixPropagator::makeAtomStep(SteppingHelixStateInfo&, SteppingHelixStateInfo&, double, PropagationDirection, SteppingHelixPropagator::Fancy) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackPropagationSteppingHelixPropagator.so
#10 0x0000ffff3f90bef0 in SteppingHelixPropagator::propagate(SteppingHelixStateInfo (&) [8], int&, SteppingHelixPropagator::DestType, double const*, double) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackPropagationSteppingHelixPropagator.so
#11 0x0000ffff3f90d6f4 in SteppingHelixPropagator::propagate(SteppingHelixStateInfo const&, Plane const&, SteppingHelixStateInfo&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackPropagationSteppingHelixPropagator.so
#12 0x0000ffff3f945934 in CachedTrajectory::propagate(SteppingHelixStateInfo&, Plane const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsTrackAssociator.so
#13 0x0000ffff3f946044 in CachedTrajectory::propagateForward(SteppingHelixStateInfo&, float) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsTrackAssociator.so
#14 0x0000ffff3f946494 in CachedTrajectory::propagateAll(SteppingHelixStateInfo const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsTrackAssociator.so
#15 0x0000ffff3f959c9c in TrackDetectorAssociator::associate(edm::Event const&, edm::EventSetup const&, TrackAssociatorParameters const&, FreeTrajectoryState const*, FreeTrajectoryState const*) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsTrackAssociator.so
#16 0x0000ffff3f95a310 in TrackDetectorAssociator::associate(edm::Event const&, edm::EventSetup const&, reco::Track const&, TrackAssociatorParameters const&, TrackDetectorAssociator::Direction) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsTrackAssociator.so
#17 0x0000ffff24e8d8cc in MuonIdProducer::fillMuonId(edm::Event&, edm::EventSetup const&, reco::Muon&, TrackDetectorAssociator::Direction) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoMuonMuonIdentificationPlugins.so
#18 0x0000ffff24e91644 in MuonIdProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoMuonMuonIdentificationPlugins.so
[cut]
Thread 4 (Thread 0xffff38788460 (LWP 45713)):
#0  0x0000ffff9b35a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff9b35a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffff968543b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff830e42f8 in MultiTrajectoryStateAssembler::addStateVector(std::vector<TrajectoryStateOnSurface, std::allocator<TrajectoryStateOnSurface> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTools.so
#5  0x0000ffff830e4ce4 in MultiTrajectoryStateAssembler::addState(TrajectoryStateOnSurface) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTools.so
#6  0x0000ffff3eb523f8 in GsfMaterialEffectsUpdator::updateState(TrajectoryStateOnSurface const&, PropagationDirection) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTracking.so
#7  0x0000ffff3eb490c0 in FullConvolutionWithMaterial::operator()(TrajectoryStateOnSurface const&, PropagationDirection) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTracking.so
#8  0x0000ffff3eb53d1c in GsfPropagatorWithMaterial::convoluteWithMaterial(std::pair<TrajectoryStateOnSurface, double> const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTracking.so
#9  0x0000ffff3eb541f8 in GsfPropagatorWithMaterial::propagateWithPath(TrajectoryStateOnSurface const&, Plane const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTracking.so
#10 0x0000ffff3eb56f80 in GsfTrajectoryFitter::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libTrackingToolsGsfTracking.so
#11 0x0000ffff228cee28 in LowPtGsfElectronSeedProducer::lightGsfTracking(reco::PreId&, edm::Ref<std::vector<reco::Track, std::allocator<reco::Track> >, reco::Track, edm::refhelper::FindUsingAdvance<std::vector<reco::Track, std::allocator<reco::Track> >, reco::Track> > const&, reco::ElectronSeed const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoEgammaEgammaElectronProducersPlugins.so
#12 0x0000ffff228da820 in void LowPtGsfElectronSeedProducer::loop<reco::PFRecTrack>(edm::Handle<std::vector<reco::PFRecTrack, std::allocator<reco::PFRecTrack> > > const&, edm::Handle<std::vector<reco::PFCluster, std::allocator<reco::PFCluster> > >&, std::vector<reco::ElectronSeed, std::allocator<reco::ElectronSeed> >&, std::vector<reco::PreId, std::allocator<reco::PreId> >&, std::vector<reco::PreId, std::allocator<reco::PreId> >&, std::unordered_map<unsigned int, unsigned long, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, unsigned long> > >&, edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoEgammaEgammaElectronProducersPlugins.so
#13 0x0000ffff228d1164 in LowPtGsfElectronSeedProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoEgammaEgammaElectronProducersPlugins.so
[cut]

Thread 3 (Thread 0xffff39198460 (LWP 45712)):
#0  0x0000ffff9b384e24 in poll () from /lib64/libc.so.6
#1  0x0000ffff96854a6c in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x0000ffff968551cc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x0000ffff968561cc in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000ffff200000a0 in ?? ()
#6  0x0000ffff391971e0 in ?? ()
#7  0x0000ffff9347d480 in TFormula::DoEval(double const*, double const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/external/slc7_aarch64_gcc820/lib/libHist.so
#8  0x0000ffff9347d480 in TFormula::DoEval(double const*, double const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/external/slc7_aarch64_gcc820/lib/libHist.so
#9  0x0000fffe044debe0 in ?? ()
[Thread 2 cut]
Thread 1 (Thread 0xffff9abd0000 (LWP 45492)):
#0  0x0000ffff9b35a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff9b35a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffff968543b8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x0000ffff24ffe7a0 in GroupedCkfTrajectoryBuilder::groupedLimitedCandidates(TrajectorySeed const&, TempTrajectory const&, TrajectoryFilter const*, Propagator const*, bool, std::vector<TempTrajectory, std::allocator<TempTrajectory> >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoTrackerCkfPatternPlugins.so
#5  0x0000ffff24fff148 in GroupedCkfTrajectoryBuilder::buildTrajectories(TrajectorySeed const&, std::vector<Trajectory, std::allocator<Trajectory> >&, unsigned int&, TrajectoryFilter const*) const () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/pluginRecoTrackerCkfPatternPlugins.so
#6  0x0000ffff25ff94b4 in cms::CkfTrackCandidateMakerBase::produceBase(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02642/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-19-2300/lib/slc7_aarch64_gcc820/libRecoTrackerCkfPattern.so

Current Modules:

Module: ShiftedParticleProducer:shiftedPatUnclusteredEnUp (crashed)
Module: CkfTrackCandidateMaker:pixelPairStepTrackCandidates
Module: none
Module: MuonIdProducer:earlyDisplacedMuons
Module: none

This one definitely shows that we are crashing in TFormula::DoEval. The interesting thing to note is that the two TFormula::DoEval calls shown on the stack have exactly the same address! I have a very hard time being able to explain how that would reasonably be true.

p.s. how long have you been waiting to use "antepenultimate"?

I knew such a word existed, but I actually had to look it up :).

Dr15Jones commented 4 years ago

One part about our hypothesis that it is a race condition associated to some sort of initialization is in the past when we've seen that, the majority of the tracebacks showed more than 1 thread around the region for the crash. In this case that does not seem to be the tendency. I'm wondering if the problem could be related to thread local storage.

Dr15Jones commented 4 years ago

Here's a crash from one of the unit tests

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-20-1100/unitTestLogs/DQMServices/Demo

With the error being

+ cmsRun /home/cmsbld/jenkins_c/workspace/ib-run-qa/CMSSW_11_2_X_2020-08-20-1100/src/DQMServices/Demo/test/run_analyzers_cfg.py outfile=huge.root numberEventsInRun=300 numberEventsInLuminosityBlock=100 nEvents=600 nThreads=10 nConcurrent=2 howmany=1000 nolegacy=True
%MSG-i ThreadStreamSetup:  (NoModuleName) 20-Aug-2020 13:17:09 CEST pre-events
setting # threads 10
setting # streams 10
%MSG

A fatal system signal has occurred: 

A fatal system signal has occurred: segmentation violationsegmentation violation
The following is the call stack containing the origin of the signal.

thread_monitor Resource temporarily unavailable in pthread_create

The following is the call stack containing the origin of the signal.

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

terminate called after throwing an instance of 'edm::Exception'
  what():  An exception of category 'FatalRootError' occurred.
   Additional Info:
      [a] Fatal Root Error: @SUB=TClass::LoadClassInfo
no interpreter information for class TVirtualStreamerInfo is available even though it has a TClass initialization routine.

/home/cmsbld/jenkins_c/workspace/ib-run-qa/CMSSW_11_2_X_2020-08-20-1100/src/DQMServices/Demo/test/runtests.sh: line 87: 190547 Aborted                 cmsRun $LOCAL_TEST_DIR/run_analyzers_cfg.py outfile=huge.root numberEventsInRun=300 numberEventsInLuminosityBlock=100 nEvents=600 nThreads=10 nConcurrent=2 howmany=1000 nolegacy=True
status = 256
Thu Aug 20 13:17:12 CEST 2020

---> test TestDQMServicesDemo had ERRORS

It looks like the previous successful unit tests in that package were only using 1 thread. Another thing to note is we didn't even make it to the first events to be processed before the crashed happen (since there are on 'Begin processing' messages).

dan131riley commented 4 years ago

Here's an assertion failure in PFProducer:particleFlowTmp:

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-24-1100/pyRelValMatrixLogs/run/136.727_RunDoubleEGPrpt2016B+RunDoubleEGPrpt2016B+HLTDR2_2016+RECODR2_2016reHLT_HIPM+HARVESTDR2/step3_RunDoubleEGPrpt2016B+RunDoubleEGPrpt2016B+HLTDR2_2016+RECODR2_2016reHLT_HIPM+HARVESTDR2.log

cmsRun: /home/cmsbld/jenkins_b/workspace/build-any-ib/w/tmp/BUILDROOT/a66bd61c43fd711ce3ac358fe068d8e6/opt/cmssw/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-08-23-0000/src/RecoParticleFlow/PFProducer/src/PFAlgo.cc:2175: void PFAlgo::createCandidatesHCAL(const reco::PFBlock&amp, reco::PFBlock::LinkData&amp, const edm::OwnVector&lt;reco::PFBlockElement&gt;&amp, std::vector&lt;bool&gt;&amp, const PFBlockRef&amp, ElementIndices&amp, std::vector&lt;bool&gt;&amp): Assertion `caloEnergy &gt;= 0' failed.

PFProducer appears to use a TFormula from here:

https://github.com/cms-sw/cmssw/blob/ddc97a523f79f3dc7983c8d9bcd501d534c1aeea/RecoParticleFlow/PFProducer/plugins/PFProducer.cc#L214-L225

That's in PFProducer::beginRun(), so it could be subject to the use-before-constructed bug I opened a ROOT ticket for, and pfEnergyCalibration_ is used in the caculation of caloEnergy. So apparently the TFormula concurrency issues can cause incorrect results as well as segmentation faults.

dan131riley commented 4 years ago

I found two logs with non-informative stack traces that claim to have crashed in TrackSelector:esSelectedTracks, which uses StringCutObjectSelector, not TFormula. In both cases, AlignmentTrackSelectorModule:ALCARECOMuAlOverlapsGeneralTracks is also running.

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-31-2300/pyRelValMatrixLogs/run/12834.0_TTbar_14TeV+2024+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST+ALCA/step5_TTbar_14TeV+2024+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST+ALCA.log https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-31-2300/pyRelValMatrixLogs/run/10001.0_SingleElectronPt10+2017+SingleElectronPt10_pythia8_GenSimINPUT+Digi+Reco+HARVEST+ALCA+Nano/step5_SingleElectronPt10+2017+SingleElectronPt10_pythia8_GenSimINPUT+Digi+Reco+HARVEST+ALCA+Nano.log

mrodozov commented 4 years ago

Here's a crash from one of the unit tests

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-08-20-1100/unitTestLogs/DQMServices/Demo

With the error being

+ cmsRun /home/cmsbld/jenkins_c/workspace/ib-run-qa/CMSSW_11_2_X_2020-08-20-1100/src/DQMServices/Demo/test/run_analyzers_cfg.py outfile=huge.root numberEventsInRun=300 numberEventsInLuminosityBlock=100 nEvents=600 nThreads=10 nConcurrent=2 howmany=1000 nolegacy=True
%MSG-i ThreadStreamSetup:  (NoModuleName) 20-Aug-2020 13:17:09 CEST pre-events
setting # threads 10
setting # streams 10
%MSG

A fatal system signal has occurred: 

A fatal system signal has occurred: segmentation violationsegmentation violation
The following is the call stack containing the origin of the signal.

thread_monitor Resource temporarily unavailable in pthread_create

The following is the call stack containing the origin of the signal.

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

terminate called after throwing an instance of 'edm::Exception'
  what():  An exception of category 'FatalRootError' occurred.
   Additional Info:
      [a] Fatal Root Error: @SUB=TClass::LoadClassInfo
no interpreter information for class TVirtualStreamerInfo is available even though it has a TClass initialization routine.

/home/cmsbld/jenkins_c/workspace/ib-run-qa/CMSSW_11_2_X_2020-08-20-1100/src/DQMServices/Demo/test/runtests.sh: line 87: 190547 Aborted                 cmsRun $LOCAL_TEST_DIR/run_analyzers_cfg.py outfile=huge.root numberEventsInRun=300 numberEventsInLuminosityBlock=100 nEvents=600 nThreads=10 nConcurrent=2 howmany=1000 nolegacy=True
status = 256
Thu Aug 20 13:17:12 CEST 2020

---> test TestDQMServicesDemo had ERRORS

It looks like the previous successful unit tests in that package were only using 1 thread. Another thing to note is we didn't even make it to the first events to be processed before the crashed happen (since there are on 'Begin processing' messages).

this one is failing because of limitation on virtual memory: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Demo/test/runtests.sh#L85 and the ASAN build is reporting it: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_ASAN_X_2020-09-16-2300/unitTestLogs/DQMServices/Demo#/5729 with errno: 12 (cannot allocate memory) and as we now know virt memory use in Arm is more compared with amd and ibm, and thats why it hit the limit and fails. I removed the ulimit and the test ran. I ran with cmsrunglibc and it also ran This limit was added here: https://github.com/cms-sw/cmssw/commit/27dd907088652012ffdcec14599446561c334331 but no explanation why it was set.

makortel commented 4 years ago

this one is failing because of limitation on virtual memory: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Demo/test/runtests.sh#L85 and the ASAN build is reporting it: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_ASAN_X_2020-09-16-2300/unitTestLogs/DQMServices/Demo#/5729 with errno: 12 (cannot allocate memory) and as we now know virt memory use in Arm is more compared with amd and ibm, and thats why it hit the limit and fails. I removed the ulimit and the test ran. I ran with cmsrunglibc and it also ran This limit was added here: 27dd907 but no explanation why it was set.

@cms-sw/dqm-l2 @schneiml Can you comment why the test added in https://github.com/cms-sw/cmssw/commit/27dd907088652012ffdcec14599446561c334331 (#28612) explicitly limits virtual memory?

jfernan2 commented 4 years ago

Hi @makortel @schneiml is not in CMS any longer since 31st July, I can try to contact him privately if needed but my understanding from his PR description is that he introduced this check in order to not blow up in IB testing

makortel commented 4 years ago

Should the limit be changed to be on RSS then?

jfernan2 commented 4 years ago

Yes, I believe it makes sense