Closed mrodozov closed 4 years ago
A new Issue was created by @mrodozov Mircho Rodozov.
@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign dqm, geometry
New categories assigned: geometry,dqm
@Dr15Jones,@cvuosalo,@mdhildreth,@jfernan2,@andrius-k,@schneiml,@fioriNTU,@makortel,@kmaeshima,@ianna,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks
@mrodozov Is that the full log? I'd guess from the thread numbers that something could be missing from there.
runP_Tracker_cfg1.log here is the full log
Thanks. So of the 257 threads, 255 are in tbb::internal::futex_wait
(!?).
strange, why cmsRun is using 256 threads? What is default thread value for cmsRun?
The default should be 1...
Here is a relevant stack trace I got when running myself
Thread 1 (Thread 0x7f15c4e93740 (LWP 29425)):
#0 0x00007f15c3e7fc3d in poll () from /lib64/libc.so.6
#1 0x00007f15acb7eb2f in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#2 0x00007f15acb7f26c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3 0x00007f15acb80149 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x0000000007626070 in ?? ()
#6 0x00007f15662f97fb in MaterialBudgetTree::endOfRun() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#7 0x00007f1566115330 in RunAction::EndOfRunAction(G4Run const*) () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#8 0x00007f156611ef6b in RunManagerMTWorker::terminateRun() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#9 0x00007f1566124de0 in RunManagerMTWorker::~RunManagerMTWorker() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#10 0x00007f15660fa2d8 in OscarMTProducer::~OscarMTProducer() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#11 0x00007f15660fa32e in OscarMTProducer::~OscarMTProducer() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#12 0x00007f15c2416ad6 in edm::stream::ProducingModuleAdaptorBase<edm::stream::EDProducerBase>::~ProducingModuleAdaptorBase() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#13 0x00007f156610182c in edm::stream::ProducingModuleAdaptor<OscarMTProducer, edm::stream::EDProducerBase, edm::stream::EDProducerAdaptorBase>::~ProducingModuleAdaptor() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#14 0x00007f156606d7a1 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase>, std::allocator<edm::maker::ModuleHolderT<edm::stream::EDProducerAdaptorBase> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/biglib/slc7_amd64_gcc820/pluginSimulation.so
#15 0x00007f15c23a9368 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#16 0x00007f15c23a93c6 in std::_Sp_counted_ptr<edm::ModuleRegistry*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#17 0x00007f15c25f9207 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#18 0x00007f15c22d5b5f in std::default_delete<edm::Schedule>::operator()(edm::Schedule*) const [clone .isra.1201] [clone .part.1202] () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007f15c22dc298 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x00007f15c25fbed8 in pybind11::class_<PythonEventProcessor>::dealloc(pybind11::detail::value_and_holder&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#21 0x00007f15c260701c in pybind11::detail::clear_instance(_object*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#22 0x00007f15c2607ffe in pybind11_object_dealloc () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCorePythonFramework.so
#23 0x00007f15c4b0a9ec in dict_dealloc (mp=0x7f15ad0995c8) at Objects/dictobject.c:1086
#24 0x00007f15c4b29ae6 in subtype_dealloc (self=0x7f15ad10bf90) at Objects/typeobject.c:1035
#25 0x00007f15c4b09e7f in insertdict_by_entry (mp=0x7f15c4e2a168, key=0x7f15c4d077e0, hash=768398212998346496, ep=0x16edcf0, value=0x7f15c4c3de00 <_Py_NoneStruct>) at Objects/dictobject.c:519
#26 0x00007f15c4b0c9ef in insertdict (value=0x7f15c4c3de00 <_Py_NoneStruct>, hash=768398212998346496, key=0x7f15c4d077e0, mp=0x7f15c4e2a168) at Objects/dictobject.c:556
#27 dict_set_item_by_hash_or_entry (value=0x7f15c4c3de00 <_Py_NoneStruct>, ep=0x0, hash=768398212998346496, key=0x7f15c4d077e0, op=0x7f15c4e2a168) at Objects/dictobject.c:795
#28 PyDict_SetItem (op=op@entry=0x7f15c4e2a168, key=0x7f15c4d077e0, value=value@entry=0x7f15c4c3de00 <_Py_NoneStruct>) at Objects/dictobject.c:848
#29 0x00007f15c4b10c2c in _PyModule_Clear (m=<optimized out>) at Objects/moduleobject.c:139
#30 0x00007f15c4b9171b in PyImport_Cleanup () at Python/import.c:483
#31 0x00007f15c4ba2d94 in Py_Finalize () at Python/pythonrun.c:465
#32 Py_Finalize () at Python/pythonrun.c:411
#33 0x00007f15c4bb6996 in Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:670
#34 0x00007f15c3dae555 in __libc_start_main () from /lib64/libc.so.6
#35 0x000000000040108e in _start ()
Total number of threads was 17. Threads 3-17 were in tbb::internal::futex_wait
, and 2 in edm::service::InitRootHandlers::stacktraceHelperThread()
.
I think MaterialBudgetTree::endOfRun()
expects to be run once
https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/Validation/Geometry/src/MaterialBudgetTree.cc#L212-L228
I'm wondering if at least the amount of TBB threads could be related to the thread-local storage cleanup trick in
https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/SimG4Core/Application/src/RunManagerMTWorker.cc#L188-L212
It does call tbb::task::enqueue()
which, at least with only 1 thread for TBB, does add additional thread(s?) to guarantee progress.
I just took a look and for the python module CmsRun I failed to update it to match a change I made a while ago where I moved the setting of the number of TBB threads. So right now CmsRun will use TBB's default (not cmsRun's default) which is to use all threads. I will work on a fix.
Building with debug symbols and disabling the simulation big library the segfault occurs (when it occurs) on this line https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/Validation/Geometry/src/MaterialBudgetTree.cc#L220
I added a printout
TFile *outFile = new TFile(fname.c_str(), "RECREATE");
outFile->mkdir("TEST");
+ edm::LogPrint("foo") << "MaterialBudgetTree::endOfRun() theTree " << theTree.get() << " outFile " << outFile;
if (theTree) {
TTree *t1 = theTree->CloneTree();
t1->Write();
and indeed when it is crashing it is called a second time
%MSG
MaterialBudgetTree::endOfRun() theTree 0x7ff420548e70 outFile 0x7ff4208f5220
%MSG-w MaterialBudget:
TTree Written 0x7ff4208f6f40
%MSG
22-Jul-2020 16:33:24 CEST Closed file file:single_neutrino_random.root
MaterialBudgetTree::endOfRun() theTree 0xa9d3690 outFile 0x2253540
with garbage-looking values for the pointers.
assign simulation
New categories assigned: simulation
@mdhildreth,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks
The second call to EndOfRun()
can come from
https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/SimG4Core/Application/src/RunManagerMTWorker.cc#L169-L172
when the destructor runs in a different thread than the RunManagerMTWorker::endRun()
.
@civanch Can you comment how necessary it is to call the code in
https://github.com/cms-sw/cmssw/blob/3031b83072310f0ce4053f45a3a01e91768ac13d/SimG4Core/Application/src/RunManagerMTWorker.cc#L433-L449
when destructing OscarMTProducer
for the thread(s) where the RunManagerMTWorker::endRun()
was not called in?
@makortel , may be it is not what destructor should do. Better to left something not finished if terminateRun() is not called. I will try to the fix into pending PR #30652 or better make a new one.
Thanks @civanch !
@mrodozov , can you, please, confirm that it is fixed now?
Yes, the Validation/Geometry
unit test has been succeeding since CMSSW_11_2_X_2020-07-25-1100, which was the first one to include #30887.
+1
I'm closing this it's resolved
The test failing in https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_aarch64_gcc8/CMSSW_11_2_X_2020-07-20-2300/unitTestLogs/Validation/Geometry#/ and https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc820/CMSSW_11_2_X_2020-07-20-2300/unitTestLogs/Validation/Geometry#/
has the following log:
It's very consistent in the tests but not when ran by hand.