cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

Segfault on file close in materialBudgetTrackerPlots test #30863

Closed mrodozov closed 4 years ago

mrodozov commented 4 years ago

The test failing in https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_aarch64_gcc8/CMSSW_11_2_X_2020-07-20-2300/unitTestLogs/Validation/Geometry#/ and https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-07-20-2300/unitTestLogs/Validation/Geometry#/

has the following log:


21-Jul-2020 21:00:01 CEST  Closed file file:single_neutrino_random.root

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Tue Jul 21 21:00:07 CEST 2020
Thread 257 (Thread 0xfffc3abaf200 (LWP 49931)):
#0  0x0000ffff9f13a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff9f13a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffff907768d8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-2300/lib/slc7$
#3  <signal handler called>
#4  0x0000ffff9f16a6e4 in syscall () from /lib64/libc.so.6
#5  0x0000ffff9c6b40e8 in tbb::internal::futex_wait (comparand=2, futex=0xffff8e8a852c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0xffff8e8a852c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0xffff8e8a8520) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0xffff8e8a8500) at ../../src/tbb/private_server.cpp:273
#9  0x0000ffff9c6b410c in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x0000ffff9f347d38 in start_thread () from /lib64/libpthread.so.0
#11 0x0000ffff9f16f5f0 in thread_start () from /lib64/libc.so.6
Thread 256 (Thread 0xfffc3afbf200 (LWP 49930)):
#0  0x0000ffff9f13a9c4 in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff9f13a678 in sleep () from /lib64/libc.so.6
#2  0x0000ffff907768d8 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-2300/lib/slc7$
#3  <signal handler called>
#4  0x0000ffff9f16a6e4 in syscall () from /lib64/libc.so.6
#5  0x0000ffff9c6b40e8 in tbb::internal::futex_wait (comparand=2, futex=0xffff8e8a842c) at ../../include/tbb/machine/linux_common.h:81
#6  tbb::internal::binary_semaphore::P (this=0xffff8e8a842c) at ../../src/tbb/semaphore.h:205
#7  rml::internal::thread_monitor::commit_wait (c=..., this=0xffff8e8a8420) at ../../src/tbb/../rml/server/thread_monitor.h:255
#8  tbb::internal::rml::private_worker::run (this=0xffff8e8a8400) at ../../src/tbb/private_server.cpp:273
#9  0x0000ffff9c6b410c in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#10 0x0000ffff9f347d38 in start_thread () from /lib64/libpthread.so.0
#11 0x0000ffff9f16f5f0 in thread_start () from /lib64/libc.so.6

It's very consistent in the tests but not when ran by hand.

cmsbuild commented 4 years ago

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 4 years ago

assign dqm, geometry

cmsbuild commented 4 years ago

New categories assigned: geometry,dqm

@Dr15Jones,@cvuosalo,@mdhildreth,@jfernan2,@andrius-k,@schneiml,@fioriNTU,@makortel,@kmaeshima,@ianna,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 4 years ago

@mrodozov Is that the full log? I'd guess from the thread numbers that something could be missing from there.

mrodozov commented 4 years ago

runP_Tracker_cfg1.log here is the full log

makortel commented 4 years ago

Thanks. So of the 257 threads, 255 are in tbb::internal::futex_wait (!?).

smuzaffar commented 4 years ago

strange, why cmsRun is using 256 threads? What is default thread value for cmsRun?

makortel commented 4 years ago

The default should be 1...

makortel commented 4 years ago

Here is a relevant stack trace I got when running myself

Thread 1 (Thread 0x7f15c4e93740 (LWP 29425)):
#0  0x00007f15c3e7fc3d in poll () from /lib64/libc.so.6
#1  0x00007f15acb7eb2f in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2  0x00007f15acb7f26c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  0x00007f15acb80149 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000000007626070 in ?? ()
#6  0x00007f15662f97fb in MaterialBudgetTree::endOfRun() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#7  0x00007f1566115330 in RunAction::EndOfRunAction(G4Run const*) () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#8  0x00007f156611ef6b in RunManagerMTWorker::terminateRun() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#9  0x00007f1566124de0 in RunManagerMTWorker::~RunManagerMTWorker() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#10 0x00007f15660fa2d8 in OscarMTProducer::~OscarMTProducer() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#11 0x00007f15660fa32e in OscarMTProducer::~OscarMTProducer() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#12 0x00007f15c2416ad6 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::~ProducingModuleAdaptorBase() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007f156610182c in edm::stream::ProducingModuleAdaptor<OscarMTProducer, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::~ProducingModuleAdaptor() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#14 0x00007f156606d7a1 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase>, std::allocator<edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#15 0x00007f15c23a9368 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#16 0x00007f15c23a93c6 in std::_Sp_counted_ptr<edm::ModuleRegistry*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#17 0x00007f15c25f9207 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#18 0x00007f15c22d5b5f in std::default_delete<edm::Schedule>::operator()(edm::Schedule*) const [clone .isra.1201] [clone .part.1202] () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007f15c22dc298 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x00007f15c25fbed8 in pybind11::class_<PythonEventProcessor>::dealloc(pybind11::detail::value_and_holder&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#21 0x00007f15c260701c in pybind11::detail::clear_instance(_object*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#22 0x00007f15c2607ffe in pybind11_object_dealloc () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#23 0x00007f15c4b0a9ec in dict_dealloc (mp=0x7f15ad0995c8) at Objects/dictobject.c:1086
#24 0x00007f15c4b29ae6 in subtype_dealloc (self=0x7f15ad10bf90) at Objects/typeobject.c:1035
#25 0x00007f15c4b09e7f in insertdict_by_entry (mp=0x7f15c4e2a168, key=0x7f15c4d077e0, hash=768398212998346496, ep=0x16edcf0, value=0x7f15c4c3de00 <_Py_NoneStruct>) at Objects/dictobject.c:519
#26 0x00007f15c4b0c9ef in insertdict (value=0x7f15c4c3de00 <_Py_NoneStruct>, hash=768398212998346496, key=0x7f15c4d077e0, mp=0x7f15c4e2a168) at Objects/dictobject.c:556
#27 dict_set_item_by_hash_or_entry (value=0x7f15c4c3de00 <_Py_NoneStruct>, ep=0x0, hash=768398212998346496, key=0x7f15c4d077e0, op=0x7f15c4e2a168) at Objects/dictobject.c:795
#28 PyDict_SetItem (op=op@entry=0x7f15c4e2a168, key=0x7f15c4d077e0, value=value@entry=0x7f15c4c3de00 <_Py_NoneStruct>) at Objects/dictobject.c:848
#29 0x00007f15c4b10c2c in _PyModule_Clear (m=<optimized out>) at Objects/moduleobject.c:139
#30 0x00007f15c4b9171b in PyImport_Cleanup () at Python/import.c:483
#31 0x00007f15c4ba2d94 in Py_Finalize () at Python/pythonrun.c:465
#32 Py_Finalize () at Python/pythonrun.c:411
#33 0x00007f15c4bb6996 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:670
#34 0x00007f15c3dae555 in __libc_start_main () from /lib64/libc.so.6
#35 0x000000000040108e in _start ()

Total number of threads was 17. Threads 3-17 were in tbb::internal::futex_wait, and 2 in edm::service::InitRootHandlers::stacktraceHelperThread().

makortel commented 4 years ago

I think MaterialBudgetTree::endOfRun() expects to be run once https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/Validation/Geometry/src/MaterialBudgetTree.cc#L212-L228

makortel commented 4 years ago

I'm wondering if at least the amount of TBB threads could be related to the thread-local storage cleanup trick in https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/SimG4Core/Application/src/RunManagerMTWorker.cc#L188-L212 It does call tbb::task::enqueue() which, at least with only 1 thread for TBB, does add additional thread(s?) to guarantee progress.

Dr15Jones commented 4 years ago

I just took a look and for the python module CmsRun I failed to update it to match a change I made a while ago where I moved the setting of the number of TBB threads. So right now CmsRun will use TBB's default (not cmsRun's default) which is to use all threads. I will work on a fix.

makortel commented 4 years ago

Building with debug symbols and disabling the simulation big library the segfault occurs (when it occurs) on this line https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/Validation/Geometry/src/MaterialBudgetTree.cc#L220

makortel commented 4 years ago

I added a printout

   TFile *outFile = new TFile(fname.c_str(), "RECREATE");
   outFile->mkdir("TEST");

+  edm::LogPrint("foo") << "MaterialBudgetTree::endOfRun() theTree " << theTree.get() << " outFile " << outFile;
   if (theTree) {
     TTree *t1 = theTree->CloneTree();
     t1->Write();

and indeed when it is crashing it is called a second time

%MSG
MaterialBudgetTree::endOfRun() theTree 0x7ff420548e70 outFile 0x7ff4208f5220
%MSG-w MaterialBudget: 
TTree Written 0x7ff4208f6f40
%MSG
22-Jul-2020 16:33:24 CEST  Closed file file:single_neutrino_random.root
MaterialBudgetTree::endOfRun() theTree 0xa9d3690 outFile 0x2253540

with garbage-looking values for the pointers.

makortel commented 4 years ago

assign simulation

cmsbuild commented 4 years ago

New categories assigned: simulation

@mdhildreth,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 4 years ago

The second call to EndOfRun() can come from https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/SimG4Core/Application/src/RunManagerMTWorker.cc#L169-L172 when the destructor runs in a different thread than the RunManagerMTWorker::endRun().

@civanch Can you comment how necessary it is to call the code in https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/SimG4Core/Application/src/RunManagerMTWorker.cc#L433-L449 when destructing OscarMTProducer for the thread(s) where the RunManagerMTWorker::endRun() was not called in?

civanch commented 4 years ago

@makortel , may be it is not what destructor should do. Better to left something not finished if terminateRun() is not called. I will try to the fix into pending PR #30652 or better make a new one.

civanch commented 4 years ago

30887 should fix the issue.

makortel commented 4 years ago

Thanks @civanch !

civanch commented 4 years ago

@mrodozov , can you, please, confirm that it is fixed now?

makortel commented 4 years ago

Yes, the Validation/Geometry unit test has been succeeding since CMSSW_11_2_X_2020-07-25-1100, which was the first one to include #30887.

civanch commented 4 years ago

+1

mrodozov commented 4 years ago

I'm closing this it's resolved