Open makortel opened 4 years ago
A new Issue was created by @makortel Matti Kortelainen.
@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
The Rivet update (that shows errors in the logs) https://github.com/cms-sw/cmsdist/pull/6028 was included in that IB, could that be the cause?
@intrepid42 does this has something to do with the update itself ot its the FileReadError exception indeed ?
@makortel I don't know whats happening but I just ran
1302.18_ProdTTbar_13UP18+ProdTTbar_13UP18+DIGIUP18PROD1+RECOPRODUP18+MINIAODMCUP18+NANOPRODUP18 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED Step4-PASSED - time date Mon Jul 20 16:00:20 2020-date Mon Jul 20 15:48:04 2020; exit: 0 0 0 0 0 1 1 1 1 1 tests passed, 0 0 0 0 0 failed
wihtout an error :/
@mrodozov how many threads did you use when you ran your test?
It (1302.18) fails for me in CMSSW_11_2_X_2020-07-20-1100 already on a single thread using two threads and streams (which appears to be the default even if the rest of the workflow is run with single thread).
In my singletwo-thread test above, I get the failure on the second event, and nothing on file read
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 1 at 20-Jul-2020 16:23:16.599 CEST
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 20-Jul-2020 16:23:16.599 CEST
Rivet.AnalysisHandler: INFO Using named weights
==============================================================
======== HiggsTemplateCrossSections Initialization =========
==============================================================
Rivet.Analysis.HiggsTemplateCrossSections: WARN No HIGGSPRODMODE shell variable found. Needed when running Rivet stand-alone.
==============================================================
======== Higgs prod mode 0 =========
======== Sucessful Initialization =========
==============================================================
Rivet.Analysis.HiggsTemplateCrossSections: WARN Unkown Higgs production mechanism. Cannot classify event. Classification for all events will most likely fail.
Rivet.AnalysisHandler: INFO Using named weights
Error in RivetAnalysis::init method: No projections registered for parent 0x7fbab1a55780
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
...
Thread 3 (Thread 0x7fbb001ff700 (LWP 20541)):
#0 0x00007fbb36fdf85d in nanosleep () from /lib64/libc.so.6
#1 0x00007fbb36fdf6f4 in sleep () from /lib64/libc.so.6
#2 0x00007fbb2ff6c510 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007fbb3700fbdd in fdatasync () from /lib64/libc.so.6
#5 0x00007fbb32661c86 in File::flush() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libUtilitiesStorageFactory.so
#6 0x00007fbb3265ab5e in StorageAccountProxy::flush() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libUtilitiesStorageFactory.so
#7 0x00007fbb2f7d07cc in TStorageFactoryFile::SysSync(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libIOPoolTFileAdaptor.so
#8 0x00007fbb38f081e7 in TFile::Flush() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRIO.so
#9 0x00007fbb38f143e5 in TFile::WriteHeader() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRIO.so
#10 0x00007fbb38f13599 in TFile::Close(char const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRIO.so
#11 0x00007fbb389b4601 in (anonymous namespace)::R__ListSlowClose(TList*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libCore.so
#12 0x00007fbb389b4f14 in TROOT::CloseFiles() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libCore.so
#13 0x00007fbb36f53ce9 in __run_exit_handlers () from /lib64/libc.so.6
#14 0x00007fbb36f53d37 in exit () from /lib64/libc.so.6
#15 0x00007fbaf2597df7 in Rivet::AnalysisHandler::init(HepMC::GenEvent const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) [clone .cold.1549] () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRivet.so
#16 0x00007fbaf2617fbc in Rivet::AnalysisHandler::analyze(HepMC::GenEvent const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRivet.so
#17 0x00007fbafe9b3c95 in ParticleLevelProducer::produce(edm::Event&, edm::EventSetup const&) () at /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/src/GeneratorInterface/RivetInterface/plugins/ParticleLevelProducer.cc:119
#18 0x00007fbb39bf5309 in edm::one::EDProducerBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007fbb39bd7e83 in edm::WorkerT<edm::one::EDProducerBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x00007fbb39b3ec9a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#21 0x00007fbb39b3ee65 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#22 0x00007fbb39b3f16b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#23 0x00007fbb39b411b5 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#24 0x00007fbb39b412d1 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#25 0x00007fbb382c1bfd in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fbb33cfbe00, context_guard=..., t=t@entry=0x7fbafea81c40, isolation=isolation@entry=0) at ../../src/tbb/custom_scheduler.h:393
#26 0x00007fbb382c1ef5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fbb33cfbe00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#27 0x00007fbb382bb9ff in tbb::internal::arena::process (this=0x7fbb33dbba00, s=...) at ../../src/tbb/arena.cpp:196
#28 0x00007fbb382ba3d3 in tbb::internal::market::process (this=0x7fbb33dc3580, j=...) at ../../src/tbb/market.cpp:667
#29 0x00007fbb382b67dc in tbb::internal::rml::private_worker::run (this=0x7fbb3526e180) at ../../src/tbb/private_server.cpp:266
#30 0x00007fbb382b69e9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#31 0x00007fbb372efea5 in start_thread () from /lib64/libpthread.so.0
#32 0x00007fbb370188dd in clone () from /lib64/libc.so.6
assign generators
New categories assigned: generators
@alberto-sanchez,@SiewYan,@qliphy,@GurpreetSinghChahal,@mkirsano,@agrohsje you have been requested to review this Pull request/Issue and eventually sign? Thanks
It's callng exit?!
It (1302.18) fails for me in CMSSW_11_2_X_2020-07-20-1100 already on a single thread. ... In my single-thread test above
I was wrong, the NANO step appears to run on two threads even if the rest of the workflow runs with single thread.
It's callng exit?!
Nice catch, that is also visible in the log I linked to in the issue description
#20 0x00002b0f20515426 in TCling::~TCling() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libCling.so
#21 0x00002b0f205156e9 in TCling::~TCling() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libCling.so
#22 0x00002b0f158dce01 in TROOT::~TROOT() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libCore.so
#23 0x00002b0f17124ce9 in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00002b0f17124d37 in exit () from /lib64/libc.so.6
#25 0x00002b0f51661df7 in Rivet::AnalysisHandler::init(HepMC::GenEvent const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) [clone .cold.1549] () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libRivet.so
#26 0x00002b0f516e1fbc in Rivet::AnalysisHandler::analyze(HepMC::GenEvent const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libRivet.so
#27 0x00002b0f50976c95 in ParticleLevelProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02638/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-19-0000/lib/slc7_amd64_gcc820/pluginGeneratorInterfaceRivetInterface_plugins.so
I seems to me that the culprit call to exit()
is here
https://gitlab.com/hepcedar/rivet/-/blob/rivet-3.1.2/src/Core/AnalysisHandler.cc#L131-139
There is also another call to exit()
https://gitlab.com/hepcedar/rivet/-/blob/rivet-3.1.2/src/Core/AnalysisHandler.cc#L110-114
The calls to exit()
are bad, but, on the other hand, they do report an actual error that presumably should have failed the job anyway. I suppose to options are to either revert https://github.com/cms-sw/cmsdist/pull/6028, try to understand the actual error better, or disable Rivet in the NANO step. (further ideas?)
But we should act rather quickly because the errors affect PR testing. So maybe we should revert https://github.com/cms-sw/cmsdist/pull/6028 until the actual error is understood?
@intrepid42 I'm temporary reverting https://github.com/cms-sw/cmsdist/pull/6028 and https://github.com/cms-sw/cmssw/pull/30590 looking forward a fix of this error (see https://github.com/cms-sw/cmsdist/pull/6103 and https://github.com/cms-sw/cmssw/pull/30838)
Ok, I can investigate later this week.
Hi all,
I think the problem is the rudimentary thread support new in Rivet 3.1.2: https://gitlab.com/hepcedar/rivet/-/commit/8fd4a0a12e0e630d1e127d806883502e6a95b72c#fd462cb6867ecfc08d0e1566dcf449b15c31ae5e_97_108
What seems to happen is that the AnalysisHandler+ProjectionHandler initiliazation is done only for thread 1 (and the AnalysisHandler gets marked as initalized globally), and afterwards thread 2 tries to access it's own ProjectionHandler which was uninitialized and thus does not contain any projections.
I converted the producer to edm::stream but this does not change the situation.
It works when I create the AnalysisHandler within produce
-> it is initiliazed and used only for one event at a time and then discarded. Unfortunately, this adds some printout for every event.
Is there any way to avoid keep multiple streams completely separate, without sharing any global objects?
Or is there a way to fill a map<edm::StreamID,AnalysisHandler>
in beginStream
and then run something like analysishandler_[edm::StreamID]->doSomething()
in the produce
function?
Cheers, Markus
Thanks @intrepid42 for the details.
Or is there a way to fill a
map<edm::StreamID,AnalysisHandler>
inbeginStream
and then run something likeanalysishandler_[edm::StreamID]->doSomething()
in theproduce
function?
That happens automatically with edm::stream
modules. Each edm::stream
module instance gets called only by a single stream, so the class member data acts already like map<edm::StreamID, T>::at(edm::StreamID)
.
Is there any way to avoid keep multiple streams completely separate, without sharing any global objects?
By construction in C++ global objects are global, i.e. shared by all the threads, within a process.
I think the problem is the rudimentary thread support new in Rivet 3.1.2: ... What seems to happen is that the AnalysisHandler+ProjectionHandler initiliazation is done only for thread 1 (and the AnalysisHandler gets marked as initalized globally), and afterwards thread 2 tries to access it's own ProjectionHandler which was uninitialized and thus does not contain any projections.
Does ProjectionHandler
need some special initialization? Your link
https://gitlab.com/hepcedar/rivet/-/commit/8fd4a0a12e0e630d1e127d806883502e6a95b72c#fd462cb6867ecfc08d0e1566dcf449b15c31ae5e_97_108
would suggest to me that it would be initialized on the fly whenever it's asked first time by a thread. Or is the intended pattern perhaps that somewhere else all threads call ProjectionHandller::getInstance()
and do some post-construction initialization before going into the "event loop"?
Based on a quick look (and zero earlier Rivet knowledge) this pattern (singleton changed to a thread-local singleton) has caused us headache every time we encounter one. If I understood correctly what the comment
// @todo the following is a temporary fix to allow for basic
// threading. The proper fix will involve the AnalysisHandler
// having it's own ProjectionHandler object.
indicates, that solution would be much more preferred. Do you know what would be their timescale to get there?
I asked the Rivet authors for ideas: https://gitlab.com/hepcedar/rivet/-/issues/84
But, it seems that with a bit different arrangment of initialization things are working fine, even as stream modules now: https://github.com/cms-sw/cmssw/pull/30888
I asked the Rivet authors for ideas: https://gitlab.com/hepcedar/rivet/-/issues/84
Thanks!
But, it seems that with a bit different arrangment of initialization things are working fine, even as stream modules now: #30888
I can see that it might seem to work, but I suspect that approach is not compatible in general with the way we use TBB. I gave more detailed comments in the PR.
IB CMSSW_11_2_X_2020-07-19-2300 shows many failures in NANO step along
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_X_2020-07-19-2300/pyRelValMatrixLogs/run/1302.18_ProdTTbar_13UP18+ProdTTbar_13UP18+DIGIUP18PROD1+RECOPRODUP18+MINIAODMCUP18+NANOPRODUP18/step5_ProdTTbar_13UP18+ProdTTbar_13UP18+DIGIUP18PROD1+RECOPRODUP18+MINIAODMCUP18+NANOPRODUP18.log#/ (the actual branch to be read varies between workflows)