cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.33k forks source link

Job failures in NANO step #30836

Open makortel opened 4 years ago

makortel commented 4 years ago

IB CMSSW_11_2_X_2020-07-19-2300 shows many failures in NANO step along

Rivet.AnalysisHandler: INFO  Using named weights
==============================================================
========     HiggsTemplateCrossSections Initialization     =========
==============================================================
Rivet.Analysis.HiggsTemplateCrossSections: WARN  No HIGGSPRODMODE shell variable found. Needed when running Rivet stand-alone.
==============================================================
========             Higgs prod mode 0              =========
========          Sucessful Initialization           =========
==============================================================
Rivet.Analysis.HiggsTemplateCrossSections: WARN  Unkown Higgs production mechanism. Cannot classify event. Classification for all events will most likely fail.
Rivet.AnalysisHandler: INFO  Using named weights
Error in RivetAnalysis::init method: No projections registered for parent 0x2b0f50c7a600
----- Begin Fatal Exception 20-Jul-2020 05:16:37 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Rethrowing an exception that happened on a different thread.
   [1] Reading branch edmTriggerResults_TriggerResults__HLT.
   Additional Info:
      [a] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBufferAsync
Cannot read from a file that is not open

----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_X_2020-07-19-2300/pyRelValMatrixLogs/run/1302.18_ProdTTbar_13UP18+ProdTTbar_13UP18+DIGIUP18PROD1+RECOPRODUP18+MINIAODMCUP18+NANOPRODUP18/step5_ProdTTbar_13UP18+ProdTTbar_13UP18+DIGIUP18PROD1+RECOPRODUP18+MINIAODMCUP18+NANOPRODUP18.log#/ (the actual branch to be read varies between workflows)

cmsbuild commented 4 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 4 years ago

The Rivet update (that shows errors in the logs) https://github.com/cms-sw/cmsdist/pull/6028 was included in that IB, could that be the cause?

mrodozov commented 4 years ago

@intrepid42 does this has something to do with the update itself ot its the FileReadError exception indeed ?

mrodozov commented 4 years ago

@makortel I don't know whats happening but I just ran

1302.18_ProdTTbar_13UP18+ProdTTbar_13UP18+DIGIUP18PROD1+RECOPRODUP18+MINIAODMCUP18+NANOPRODUP18 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED Step4-PASSED - time date Mon Jul 20 16:00:20 2020-date Mon Jul 20 15:48:04 2020; exit: 0 0 0 0 0 1 1 1 1 1 tests passed, 0 0 0 0 0 failed

wihtout an error :/

Dr15Jones commented 4 years ago

@mrodozov how many threads did you use when you ran your test?

makortel commented 4 years ago

It (1302.18) fails for me in CMSSW_11_2_X_2020-07-20-1100 already on a single thread using two threads and streams (which appears to be the default even if the rest of the workflow is run with single thread).

makortel commented 4 years ago

In my singletwo-thread test above, I get the failure on the second event, and nothing on file read

Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 1 at 20-Jul-2020 16:23:16.599 CEST
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 0 at 20-Jul-2020 16:23:16.599 CEST
Rivet.AnalysisHandler: INFO  Using named weights
==============================================================
========     HiggsTemplateCrossSections Initialization     =========
==============================================================
Rivet.Analysis.HiggsTemplateCrossSections: WARN  No HIGGSPRODMODE shell variable found. Needed when running Rivet stand-alone.
==============================================================
========             Higgs prod mode 0              =========
========          Sucessful Initialization           =========
==============================================================
Rivet.Analysis.HiggsTemplateCrossSections: WARN  Unkown Higgs production mechanism. Cannot classify event. Classification for all events will most likely fail.
Rivet.AnalysisHandler: INFO  Using named weights
Error in RivetAnalysis::init method: No projections registered for parent 0x7fbab1a55780

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

...
Thread 3 (Thread 0x7fbb001ff700 (LWP 20541)):
#0  0x00007fbb36fdf85d in nanosleep () from /lib64/libc.so.6
#1  0x00007fbb36fdf6f4 in sleep () from /lib64/libc.so.6
#2  0x00007fbb2ff6c510 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007fbb3700fbdd in fdatasync () from /lib64/libc.so.6
#5  0x00007fbb32661c86 in File::flush() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libUtilitiesStorageFactory.so
#6  0x00007fbb3265ab5e in StorageAccountProxy::flush() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libUtilitiesStorageFactory.so
#7  0x00007fbb2f7d07cc in TStorageFactoryFile::SysSync(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libIOPoolTFileAdaptor.so
#8  0x00007fbb38f081e7 in TFile::Flush() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRIO.so
#9  0x00007fbb38f143e5 in TFile::WriteHeader() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRIO.so
#10 0x00007fbb38f13599 in TFile::Close(char const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRIO.so
#11 0x00007fbb389b4601 in (anonymous namespace)::R__ListSlowClose(TList*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libCore.so
#12 0x00007fbb389b4f14 in TROOT::CloseFiles() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libCore.so
#13 0x00007fbb36f53ce9 in __run_exit_handlers () from /lib64/libc.so.6
#14 0x00007fbb36f53d37 in exit () from /lib64/libc.so.6
#15 0x00007fbaf2597df7 in Rivet::AnalysisHandler::init(HepMC::GenEvent const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) [clone .cold.1549] () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRivet.so
#16 0x00007fbaf2617fbc in Rivet::AnalysisHandler::analyze(HepMC::GenEvent const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/external/slc7_amd64_gcc820/lib/libRivet.so
#17 0x00007fbafe9b3c95 in ParticleLevelProducer::produce(edm::Event&, edm::EventSetup const&) () at /build/mkortela/debug/CMSSW_11_2_X_2020-07-20-1100/src/GeneratorInterface/RivetInterface/plugins/ParticleLevelProducer.cc:119
#18 0x00007fbb39bf5309 in edm::one::EDProducerBase::doEvent(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#19 0x00007fbb39bd7e83 in edm::WorkerT<edm::one::EDProducerBase>::implDo(edm::EventPrincipal const&, edm::EventSetupImpl const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#20 0x00007fbb39b3ec9a in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#21 0x00007fbb39b3ee65 in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#22 0x00007fbb39b3f16b in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetupImpl const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#23 0x00007fbb39b411b5 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#24 0x00007fbb39b412d1 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-20-1100/lib/slc7_amd64_gcc820/libFWCoreFramework.so
#25 0x00007fbb382c1bfd in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::process_bypass_loop (this=this@entry=0x7fbb33cfbe00, context_guard=..., t=t@entry=0x7fbafea81c40, isolation=isolation@entry=0) at ../../src/tbb/custom_scheduler.h:393
#26 0x00007fbb382c1ef5 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7fbb33cfbe00, parent=..., child=<optimized out>) at ../../include/tbb/task.h:1003
#27 0x00007fbb382bb9ff in tbb::internal::arena::process (this=0x7fbb33dbba00, s=...) at ../../src/tbb/arena.cpp:196
#28 0x00007fbb382ba3d3 in tbb::internal::market::process (this=0x7fbb33dc3580, j=...) at ../../src/tbb/market.cpp:667
#29 0x00007fbb382b67dc in tbb::internal::rml::private_worker::run (this=0x7fbb3526e180) at ../../src/tbb/private_server.cpp:266
#30 0x00007fbb382b69e9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:219
#31 0x00007fbb372efea5 in start_thread () from /lib64/libpthread.so.0
#32 0x00007fbb370188dd in clone () from /lib64/libc.so.6
makortel commented 4 years ago

assign generators

cmsbuild commented 4 years ago

New categories assigned: generators

@alberto-sanchez,@SiewYan,@qliphy,@GurpreetSinghChahal,@mkirsano,@agrohsje you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 4 years ago

It's callng exit?!

makortel commented 4 years ago

It (1302.18) fails for me in CMSSW_11_2_X_2020-07-20-1100 already on a single thread. ... In my single-thread test above

I was wrong, the NANO step appears to run on two threads even if the rest of the workflow runs with single thread.

makortel commented 4 years ago

It's callng exit?!

Nice catch, that is also visible in the log I linked to in the issue description

#20 0x00002b0f20515426 in TCling::~TCling() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libCling.so
#21 0x00002b0f205156e9 in TCling::~TCling() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libCling.so
#22 0x00002b0f158dce01 in TROOT::~TROOT() () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libCore.so
#23 0x00002b0f17124ce9 in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00002b0f17124d37 in exit () from /lib64/libc.so.6
#25 0x00002b0f51661df7 in Rivet::AnalysisHandler::init(HepMC::GenEvent const&amp, std::vector&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt;, std::allocator&lt;std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; &gt; &gt; const&amp) [clone .cold.1549] () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libRivet.so
#26 0x00002b0f516e1fbc in Rivet::AnalysisHandler::analyze(HepMC::GenEvent const&amp) () from /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_2_X_2020-07-19-2300/external/slc7_amd64_gcc820/lib/libRivet.so
#27 0x00002b0f50976c95 in ParticleLevelProducer::produce(edm::Event&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02638/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-07-19-0000/lib/slc7_amd64_gcc820/pluginGeneratorInterfaceRivetInterface_plugins.so
makortel commented 4 years ago

I seems to me that the culprit call to exit() is here https://gitlab.com/hepcedar/rivet/-/blob/rivet-3.1.2/src/Core/AnalysisHandler.cc#L131-139

makortel commented 4 years ago

There is also another call to exit() https://gitlab.com/hepcedar/rivet/-/blob/rivet-3.1.2/src/Core/AnalysisHandler.cc#L110-114

makortel commented 4 years ago

The calls to exit() are bad, but, on the other hand, they do report an actual error that presumably should have failed the job anyway. I suppose to options are to either revert https://github.com/cms-sw/cmsdist/pull/6028, try to understand the actual error better, or disable Rivet in the NANO step. (further ideas?)

But we should act rather quickly because the errors affect PR testing. So maybe we should revert https://github.com/cms-sw/cmsdist/pull/6028 until the actual error is understood?

silviodonato commented 4 years ago

@intrepid42 I'm temporary reverting https://github.com/cms-sw/cmsdist/pull/6028 and https://github.com/cms-sw/cmssw/pull/30590 looking forward a fix of this error (see https://github.com/cms-sw/cmsdist/pull/6103 and https://github.com/cms-sw/cmssw/pull/30838)

mseidel42 commented 4 years ago

Ok, I can investigate later this week.

mseidel42 commented 4 years ago

Hi all,

I think the problem is the rudimentary thread support new in Rivet 3.1.2: https://gitlab.com/hepcedar/rivet/-/commit/8fd4a0a12e0e630d1e127d806883502e6a95b72c#fd462cb6867ecfc08d0e1566dcf449b15c31ae5e_97_108

What seems to happen is that the AnalysisHandler+ProjectionHandler initiliazation is done only for thread 1 (and the AnalysisHandler gets marked as initalized globally), and afterwards thread 2 tries to access it's own ProjectionHandler which was uninitialized and thus does not contain any projections.

I converted the producer to edm::stream but this does not change the situation. It works when I create the AnalysisHandler within produce -> it is initiliazed and used only for one event at a time and then discarded. Unfortunately, this adds some printout for every event.

Is there any way to avoid keep multiple streams completely separate, without sharing any global objects? Or is there a way to fill a map<edm::StreamID,AnalysisHandler> in beginStream and then run something like analysishandler_[edm::StreamID]->doSomething() in the produce function?

Cheers, Markus

makortel commented 4 years ago

Thanks @intrepid42 for the details.

Or is there a way to fill a map<edm::StreamID,AnalysisHandler> in beginStream and then run something like analysishandler_[edm::StreamID]->doSomething() in the produce function?

That happens automatically with edm::stream modules. Each edm::stream module instance gets called only by a single stream, so the class member data acts already like map<edm::StreamID, T>::at(edm::StreamID).

Is there any way to avoid keep multiple streams completely separate, without sharing any global objects?

By construction in C++ global objects are global, i.e. shared by all the threads, within a process.

I think the problem is the rudimentary thread support new in Rivet 3.1.2: ... What seems to happen is that the AnalysisHandler+ProjectionHandler initiliazation is done only for thread 1 (and the AnalysisHandler gets marked as initalized globally), and afterwards thread 2 tries to access it's own ProjectionHandler which was uninitialized and thus does not contain any projections.

Does ProjectionHandler need some special initialization? Your link https://gitlab.com/hepcedar/rivet/-/commit/8fd4a0a12e0e630d1e127d806883502e6a95b72c#fd462cb6867ecfc08d0e1566dcf449b15c31ae5e_97_108 would suggest to me that it would be initialized on the fly whenever it's asked first time by a thread. Or is the intended pattern perhaps that somewhere else all threads call ProjectionHandller::getInstance() and do some post-construction initialization before going into the "event loop"?

Based on a quick look (and zero earlier Rivet knowledge) this pattern (singleton changed to a thread-local singleton) has caused us headache every time we encounter one. If I understood correctly what the comment

    // @todo the following is a temporary fix to allow for basic
    // threading. The proper fix will involve the AnalysisHandler
    // having it's own ProjectionHandler object.

indicates, that solution would be much more preferred. Do you know what would be their timescale to get there?

mseidel42 commented 4 years ago

I asked the Rivet authors for ideas: https://gitlab.com/hepcedar/rivet/-/issues/84

But, it seems that with a bit different arrangment of initialization things are working fine, even as stream modules now: https://github.com/cms-sw/cmssw/pull/30888

makortel commented 4 years ago

I asked the Rivet authors for ideas: https://gitlab.com/hepcedar/rivet/-/issues/84

Thanks!

But, it seems that with a bit different arrangment of initialization things are working fine, even as stream modules now: #30888

I can see that it might seem to work, but I suspect that approach is not compatible in general with the way we use TBB. I gave more detailed comments in the PR.