cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

LumiProducer call to frontier is not thread safe #18501

Closed Dr15Jones closed 7 years ago

Dr15Jones commented 7 years ago

The framework now uses multiple threads to process the global begin Run transition. During that transition LumiProducer::beginRun is called and it directly calls coral::FrontierAccess::Query::execute(). Unfortunately, modules which request data from the conditions database will also trigger calls to frontier which leads to race conditions: See https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc7_amd64_gcc630/CMSSW_9_1_X_2017-04-26-2300/pyRelValMatrixLogs/run/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step2_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log

cmsbuild commented 7 years ago

A new Issue was created by @Dr15Jones Chris Jones.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 7 years ago

If the call to coral::FrontierAccess::Query::execute() were moved from LumiProducer::beginRun and moved into an ESProducer then the EventSetup lock would protect against concurrent access.

Dr15Jones commented 7 years ago

assign core

cmsbuild commented 7 years ago

New categories assigned: core

@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 7 years ago

assign db

cmsbuild commented 7 years ago

New categories assigned: db

@ggovi you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones commented 7 years ago

As an intermediate step, we could lock access to the lumi::service::DBService by getting access to the mutex used to protect the conditions system.

Dr15Jones commented 7 years ago

@DrDaveD this has to do with concurrent access to the frontier client.

Dr15Jones commented 7 years ago

I took a look at the coral interface to Frontier. The code uses a per Connection lock. However, the underlying assumption is the actual third-party code being called is at least thread-friendly. This is not the case with the frontier client interface where multiple connections touch the same memory structures.

I think the best way to fix this problem is to patch coral::FrontierAccess::Connection to use a mutex shared across all Connection instances (rather than having one per instances as is the case now).

Do others agree?

ggovi commented 7 years ago

@Dr15Jones quoting @xiezhen "the LumiProducer is no more supported. You can just remove all the code from inside, leaving empty structures, so that they can still call."

ggovi commented 7 years ago

@Dr15Jones on the coral Frontier fix, it guess it would be anyhow good to have it...

DrDaveD commented 7 years ago

Having a global lock in CORAL would prevent ever having concurrent threads using the frontier client. We have a long term plan to put a lock inside of frontier client and release the lock while the client is waiting on I/O. I think that's a better solution because it would allow interleaving I/O.

Dr15Jones commented 7 years ago

@DrDaveD what is the time scale for making the frontier client thread safe? I was hoping to get this patch (which can later be removed) in today since the problem already caused jobs in the IB to crash.

Dr15Jones commented 7 years ago

@ggovi Only the LumiProducer and ExpressLumiProducer put LumiSummary and LumiDetails objects into the edm::LuminosityBlock and we definitely have code in CMSSW which reads those objects

[cdj@cmslpc26 src]$ git grep 'LumiSummary' | grep Handle
DQM/TrackingMonitor/src/GetLumi.cc:  edm::Handle<LumiSummary> lumiSummary_;
DQMServices/Components/plugins/DQMLumiMonitor.cc:  edm::Handle<LumiSummary> lumiSummary_;
DataFormats/FWLite/scripts/edmLumisInFiles.py:        handle = Handle ('LumiSummary')
HLTrigger/HLTanalyzers/src/EventHeader.cc:  edm::Handle<LumiSummary> lumiSummary; 
L1Trigger/CSCTrackFinder/test/analysis/LCTOccupancies.cc:   edm::Handle<LumiSummary> lumiSummary;
PhysicsTools/FWLite/bin/FWLiteLumiAccess.cc:      fwlite::Handle<LumiSummary> summary;
RecoLocalTracker/SiPixelClusterizer/test/TestClusters.cc:  edm::Handle<LumiSummary> lumi;
RecoLocalTracker/SiPixelClusterizer/test/TestPixTracks.cc:  edm::Handle<LumiSummary> lumi;
RecoLocalTracker/SiPixelClusterizer/test/TestWithTracks.cc:  edm::Handle<LumiSummary> lumi;
RecoLuminosity/LumiProducer/plugins/LumiCalculator.cc:  edm::Handle<LumiSummary> lumiSummary;
RecoLuminosity/LumiProducer/plugins/LumiCalculator.cc:  edm::Handle<LumiSummaryRunHeader> lumiSummaryRH;
RecoLuminosity/LumiProducer/test/TestExpressLumiProducer.cc:    Handle<LumiSummary> lumiSummary;
RecoLuminosity/LumiProducer/test/TestLumiCorrectionSource.cc:    edm::Handle<LumiSummary> lumisummary;
RecoLuminosity/LumiProducer/test/TestLumiProducer.cc:    Handle<LumiSummary> lumiSummary;
[cdj@cmslpc26 src]$ git grep 'LumiDetails' | grep Handle
Calibration/IsolatedParticles/plugins/IsoTrig.cc:  edm::Handle<LumiDetails> Lumid;
Calibration/IsolatedParticles/plugins/StudyHLT.cc:  edm::Handle<LumiDetails> Lumid;
DPGAnalysis/SiStripTools/plugins/TrackCount.cc:  edm::Handle<LumiDetails> ld;
DPGAnalysis/SiStripTools/src/DigiLumiCorrHistogramMaker.cc:  edm::Handle<LumiDetails> ld;
DQM/TrackingMonitor/src/GetLumi.cc:  edm::Handle<LumiDetails> lumi;
DQMServices/Components/plugins/DQMLumiMonitor.cc:  Handle<LumiDetails> lumiDetails;
RecoLocalTracker/SiPixelClusterizer/test/TestClusters.cc:  edm::Handle<LumiDetails> ld;
RecoLuminosity/LumiProducer/test/TestExpressLumiProducer.cc:    Handle<LumiDetails> lumiDetails;
RecoLuminosity/LumiProducer/test/TestLumiProducer.cc:    Handle<LumiDetails> lumiDetails;
Validation/RecoVertex/src/VertexHistogramMaker.cc:  edm::Handle<LumiDetails> ld;

So even though LumiProducer isn't being maintained, it appears that it is needed.

Dr15Jones commented 7 years ago

My proposed changes are

--- coral/FrontierAccess/src/Connection.h   2017-04-28 09:43:29.945972505 -0500
+++ ../coral/src/FrontierAccess/src/Connection.h    2011-03-22 05:36:50.000000000 -0500
@@ -91,7 +91,7 @@
       /// The type converter
       TypeConverter*                                   m_typeConverter;
       /// The connection lock
-      static boost::mutex s_lock;
+      mutable boost::mutex m_lock;
     };
   } // FrontierAccess namespace
 } // coral namespace
--- coral/FrontierAccess/src/Connection.cpp 2017-04-28 09:45:26.111480886 -0500
+++ ../coral/src/FrontierAccess/src/Connection.cpp  2011-03-22 05:36:50.000000000 -0500
@@ -20,7 +20,6 @@
 #include "ErrorHandler.h"
 #include "Session.h"
 #include "TypeConverter.h"
-boost::mutex coral::FrontierAccess::Connection::s_lock{};

 coral::FrontierAccess::Connection::Connection( const coral::FrontierAccess::DomainProperties& domainProperties, const std::string& connectionString )
   : m_connection(0)
@@ -29,6 +28,7 @@
   , m_connected( false )
   , m_serverVersion( "" )
   , m_typeConverter( new coral::FrontierAccess::TypeConverter( m_domainProperties ) )
+  , m_lock()
 {
   if( this->m_typeConverter )
     this->m_typeConverter->reset( 10 );
@@ -103,7 +103,7 @@
   log << coral::Verbose << "Connecting to Frontier server using URL: " << this->m_connectionString << coral::MessageStream::endmsg;

   {
-    boost::mutex::scoped_lock lock( s_lock );
+    boost::mutex::scoped_lock lock( m_lock );

     // Attaching the server
     m_connection = new frontier::Connection( m_connectionString );
@@ -126,7 +126,7 @@
                                         m_domainProperties,
                                         m_connectionString,
                                         *m_connection,
-                                        s_lock,
+                                        m_lock,
                                         schemaName,
                                         *m_typeConverter );
   return session;
Dr15Jones commented 7 years ago

@davidlange6 @smuzaffar @davidlt Could we use CMSSW_9_1_DEVEL_X to do the following testing?

  1. revert #18504 so the framework uses multiple threads at global begin transitions
  2. apply this patch to the version of coral used in that release
  3. apply #18499 to fix another problem related to the framework change

That should be sufficient to help shake out any other problems related to the framework change.

DrDaveD commented 7 years ago

@Dr15Jones the frontier client fix will not be quick. I was assuming you were going to disable LumiProducer like Zhen said, and in that case would rather not do the coral change. If that's not going to happen, then go ahead and do the coral patch, understanding that it is temporary. I would not try to get it merged upstream.

Dr15Jones commented 7 years ago

@davidlange6 @smuzaffar ping

davidlange6 commented 7 years ago

sorry, I think we were both away the weekend.

So - if all of this is just to fix the lumi producer, we should just remove that code instead. Its not supported and not working. @slava77

On May 2, 2017, at 12:39 PM, Chris Jones notifications@github.com wrote:

@davidlange6 @smuzaffar ping

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones commented 7 years ago

Getting rid of the LumiProducer (basically all the modules in that package would have to go) would help. However, the DQM also directly uses coral so we still need the change to coral to protect against that code trying to run at the same time as the PoolDBSource.

davidlange6 commented 7 years ago

I missed the DQM issue - which piece(s) of dqm are doing this?

On May 2, 2017, at 2:06 PM, Chris Jones notifications@github.com wrote:

Getting rid of the LumiProducer (basically all the modules in that package would have to go) would help. However, the DQM also directly uses coral so we still need the change to coral to protect against that code trying to run at the same time as the PoolDBSource.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones commented 7 years ago

I found them doing

cdj@CDJ-3> git grep 'coral::Conn'
CondCore/CondDB/src/ConnectionPool.cc:      coral::ConnectionService connServ;
CondCore/CondDB/src/ConnectionPool.cc:      coral::ConnectionService connServ;
CondTools/DQM/interface/ReadBase.h:  coral::ConnectionService m_connectionService;
RecoLuminosity/LumiProducer/interface/DBConfig.h:    explicit DBConfig(coral::ConnectionService& svc);
[Lots of other RecoLuminosity/LumiProducer cut]
Dr15Jones commented 7 years ago

I need resolution on this item in order to make further progress on the framework.

davidlange6 commented 7 years ago

could you make a pr to cmsdist with your coral patch (double digits number of patches now:( )

On May 3, 2017, at 12:27 PM, Chris Jones notifications@github.com wrote:

I need resolution on this item in order to make further progress on the framework.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr15Jones commented 7 years ago

I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.

davidlange6 commented 7 years ago

perhaps - i notice that most of your patch is in the coral svn repo for some years now... @ggovi - is it hopeless to get back to the mainstream coral?

On May 3, 2017, at 12:58 PM, Chris Jones notifications@github.com wrote:

I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davidlt commented 7 years ago

IIRC, patchsrc9 is the last one. There is no support for double digits.

davidlt commented 7 years ago

https://svnweb.cern.ch/trac/lcgcoral/browser/coral/tags

Looks like the latest one is 3.1.8. We haven't updated CORAL version in CMSSW for the last 5+ years.

ggovi commented 7 years ago

On 3 May 2017, at 13:07, David Lange notifications@github.com wrote:

perhaps - i notice that most of your patch is in the coral svn repo for some years now... @ggovi - is it hopeless to get back to the mainstream coral?

it is surely not trivial, since we diverged over the last years… it would require to check the cms coral patches and check if everything has been implemented in the mainstream. Plus validate the rest of the new changes.

On May 3, 2017, at 12:58 PM, Chris Jones notifications@github.com wrote:

I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davidlange6 commented 7 years ago

that sounds more sustainable than the situation we have today. Could you have a look at the more recent releases of coral?

On May 3, 2017, at 3:09 PM, ggovi notifications@github.com wrote:

On 3 May 2017, at 13:07, David Lange notifications@github.com wrote:

perhaps - i notice that most of your patch is in the coral svn repo for some years now... @ggovi - is it hopeless to get back to the mainstream coral?

it is surely not trivial, since we diverged over the last years… it would require to check the cms coral patches and check if everything has been implemented in the mainstream. Plus validate the rest of the new changes.

On May 3, 2017, at 12:58 PM, Chris Jones notifications@github.com wrote:

I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ggovi commented 7 years ago

ok, i'll have a look.

DrDaveD commented 7 years ago

@davidelange6 Keep in mind that development on coral is essentially halted, and it is not expected to be used at all for run3, so that might make it not worth the effort to update to the latest released version of coral. If we're limited on the number of patches, some patches could be pretty easily combined I expect.

davidlange6 commented 7 years ago

where can we find more information about the plan to replace coral in cmssw? indeed its hard to gather its halted development from the rate of CORAL releases (5 in the last year, though perhaps just to support cmake in lcg)

(part of my interest was created by the mistake in reversing the coral patch but still, its not fantastic that we've forked off with no real developer)

On May 3, 2017, at 6:14 PM, DrDaveD notifications@github.com wrote:

@davidelange6 Keep in mind that development on coral is essentially halted, and it is not expected to be used at all for run3, so that might make it not worth the effort to update to the latest released version of coral. If we're limited on the number of patches, some patches could be pretty easily combined I expect.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

DrDaveD commented 7 years ago

I'm not sure the plan is very well documented, but Andrea Formica from ATLAS and Giacomo have been working on a new conditions system for run3 that will not do SQL queries and so will not need CORAL. It still uses the frontier client & server, but between the frontier server and the oracle DB is another server that does the SQL queries. The interface to this other server is higher-level http. The conditions system in the application will be able to contact the other server directly for low volume or pass its requests through frontier for caching.

Dr15Jones commented 7 years ago

@davidlange6 @smuzaffar given that CMSSW_9_2 is now open, could we have #18504 reverted in CMSSW_9_2 in order to find any additional thread-safety problems (as well as allow further development of the framework)? If that is considered to high of a risk, could #18504 be reverted in CMSSW_9_2_DEVEL_X?

Dr15Jones commented 7 years ago

18587 reverts #18504 for CMSSW_9_2_X

Dr15Jones commented 7 years ago

A crash related to the same problem happened again in the IBs https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc7_amd64_gcc630/CMSSW_9_2_X_2017-05-12-1100/pyRelValMatrixLogs/run/7.0_Cosmics+Cosmics+DIGICOS+RECOCOS+ALCACOS+HARVESTCOS/step3_Cosmics+Cosmics+DIGICOS+RECOCOS+ALCACOS+HARVESTCOS.log

The stack trace is

Thread 4 (Thread 0x7f556ecfe700 (LWP 217673)):
#0  0x00007f55af61befd in nanosleep () from /lib64/libc.so.6
#1  0x00007f55af61bd94 in sleep () from /lib64/libc.so.6
#2  0x00007f55a8bde723 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f55b215f542 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#5  0x00007f55b215fe6f in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#6  0x00007f55b2164776 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#7  0x00007f55b216b260 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#8  0x00007f5592f4f43d in frontier::FrontierException::FrontierException(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#9  0x00007f5592f59159 in frontier::RuntimeError::RuntimeError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#10 0x00007f5592f55e68 in frontier::Request::encodeParam(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#11 0x00007f5592fcdda7 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#12 0x00007f5592fe9a6d in coral::FrontierAccess::Query::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#13 0x00007f557fd9f404 in LumiProducer::getLumiDataId(coral::ISchema const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/lib/slc7_amd64_gcc630/pluginLumiProducer.so
#14 0x00007f557fda94b3 in LumiProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/lib/slc7_amd64_gcc630/pluginLumiProducer.so
#15 0x00007f55b20b965b in edm::one::EDProducerBase::doBeginRun(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#16 0x00007f55b200d6b0 in edm::WorkerT<edm::one::EDProducerBase>::implDoBegin(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#17 0x00007f55b1ff76c6 in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#18 0x00007f55b1ff794c in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#19 0x00007f55b2009c96 in void edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#20 0x00007f55b200a614 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#21 0x00007f55b200a6b1 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#22 0x00007f55b0c42983 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f55ad60fe00, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:501
#23 0x00007f55b0c3b9d2 in tbb::internal::arena::process (this=0x7f55ad783780, s=...) at ../../src/tbb/arena.cpp:159
#24 0x00007f55b0c3a4cb in tbb::internal::market::process (this=0x7f55ad797900, j=...) at ../../src/tbb/market.cpp:677
#25 0x00007f55b0c367c6 in tbb::internal::rml::private_worker::run (this=0x7f55ad5b5180) at ../../src/tbb/private_server.cpp:271
#26 0x00007f55b0c369f9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:224
#27 0x00007f55af927dc5 in start_thread () from /lib64/libpthread.so.0
#28 0x00007f55af654ced in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f556f6ff700 (LWP 217672)):
#0  0x00007f55af61befd in nanosleep () from /lib64/libc.so.6
#1  0x00007f55af61bd94 in sleep () from /lib64/libc.so.6
#2  0x00007f55a8bde723 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f55af64f469 in syscall () from /lib64/libc.so.6
#5  0x00007f55b0c369d2 in tbb::internal::futex_wait (comparand=2, futex=0x7f55ad5b512c) at ../../include/tbb/machine/linux_common.h:60
#6  tbb::internal::binary_semaphore::P (this=0x7f55ad5b512c) at ../../src/tbb/semaphore.h:206
#7  rml::internal::thread_monitor::commit_wait (c=<synthetic pointer>..., this=0x7f55ad5b5120) at ../../src/rml/include/../server/thread_monitor.h:259
#8  tbb::internal::rml::private_worker::run (this=0x7f55ad5b5100) at ../../src/tbb/private_server.cpp:278
#9  0x00007f55b0c369f9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:224
#10 0x00007f55af927dc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f55af654ced in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f5597bbf700 (LWP 217563)):
#0  0x00007f55af92eca9 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f55a8bde917 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#2  0x00007f55a8bdeff5 in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3  0x00007f55aff10c3f in std::execute_native_thread_routine (__p=0x7f5598171740) at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#4  0x00007f55af927dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f55af654ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f55add04c80 (LWP 217511)):
#0  0x00007f55af64a69d in poll () from /lib64/libc.so.6
#1  0x00007f55a8bdee44 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#2  0x00007f55a8bdf61a in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3  0x00007f55a8be0ba5 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f55b0e6d9cd in je_tcache_dalloc_small (slow_path=false, binind=<optimized out>, ptr=0x7f5592e93c00, tcache=0x7f55adc42000, tsd=<optimized out>) at include/jemalloc/internal/tcache.h:424
#6  je_arena_dalloc (slow_path=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00, tsdn=<optimized out>) at include/jemalloc/internal/arena.h:1441
#7  je_idalloctm (slow_path=false, is_metadata=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1170
#8  je_iqalloc (slow_path=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00, tsd=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1187
#9  ifree (tsd=<optimized out>, slow_path=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00) at src/jemalloc.c:1896
#10 free (ptr=0x7f5592e93c00) at src/jemalloc.c:2016
#11 0x00007f5592f55393 in fn_zfree () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#12 0x00007f55b0863b1e in deflateEnd () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libz.so.1
#13 0x00007f5592f553b5 in fn_decleanup () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#14 0x00007f5592f555b6 in fn_gzip_str () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#15 0x00007f5592f556bc in fn_gzip_str2urlenc () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#16 0x00007f5592f55db7 in frontier::Request::encodeParam(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#17 0x00007f5592fcdda7 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#18 0x00007f5592fe9a6d in coral::FrontierAccess::Query::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#19 0x00007f559697410c in cond::persistency::PAYLOAD::Table::select(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, cond::Binary&, cond::Binary&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libCondCoreCondDB.so
#20 0x00007f5592083114 in std::shared_ptr<L1GtPrescaleFactors> cond::persistency::Session::fetchPayload<L1GtPrescaleFactors>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginCondCoreL1TPlugins.so
#21 0x00007f55920834ab in cond::persistency::PayloadProxy<L1GtPrescaleFactors>::loadPayload() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginCondCoreL1TPlugins.so
#22 0x00007f5592080c47 in edm::eventsetup::DataProxyTemplate<L1GtPrescaleFactorsTechTrigRcd, L1GtPrescaleFactors>::getImpl(edm::eventsetup::EventSetupRecord const&, edm::eventsetup::DataKey const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginCondCoreL1TPlugins.so
#23 0x00007f55b204930f in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecord const&, edm::eventsetup::DataKey const&, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#24 0x00007f55b206473b in edm::eventsetup::EventSetupRecord::getFromProxy(edm::eventsetup::DataKey const&, edm::eventsetup::ComponentDescription const*&, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#25 0x00007f558e6cbe4d in L1GtTriggerMenuLiteProducer::retrieveL1EventSetup(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginEventFilterL1GlobalTriggerRawToDigi.so
#26 0x00007f558e6cd491 in L1GtTriggerMenuLiteProducer::beginRunProduce(edm::Run&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginEventFilterL1GlobalTriggerRawToDigi.so
#27 0x00007f55b20b9672 in edm::one::EDProducerBase::doBeginRun(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#28 0x00007f55b200d6b0 in edm::WorkerT<edm::one::EDProducerBase>::implDoBegin(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#29 0x00007f55b1ff76c6 in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#30 0x00007f55b1ff794c in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#31 0x00007f55b2009c96 in void edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#32 0x00007f55b200a614 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#33 0x00007f55b200a6b1 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#34 0x00007f55b0c42983 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f55ad792600, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:501
#35 0x00007f55b207b230 in edm::EventProcessor::beginRun(statemachine::Run const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#36 0x00007f55b1fc3b89 in statemachine::HandleRuns::beginRun(statemachine::Run const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#37 0x00007f55b1fc3c3c in statemachine::HandleRuns::setupCurrentRun() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#38 0x00007f55b1fc5445 in statemachine::NewRun::NewRun(boost::statechart::state<statemachine::NewRun, statemachine::HandleRuns, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#39 0x00007f55b1fcb10e in boost::statechart::state<statemachine::NewRun, statemachine::HandleRuns, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<statemachine::HandleRuns> const&, boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#40 0x00007f55b1fcdb18 in boost::statechart::state<statemachine::HandleRuns, statemachine::HandleFiles, statemachine::NewRun, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<statemachine::HandleFiles> const&, boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#41 0x00007f55b1fcdcda in boost::statechart::simple_state<statemachine::FirstFile, statemachine::HandleFiles, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#42 0x00007f55b2081e74 in boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#43 0x00007f55b2075901 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#44 0x000000000040e9d4 in main::{lambda()#1}::operator()() const ()
#45 0x000000000040d2a5 in main ()

@smuzaffar Did the patch to CORAL make it into the CMSSW_9_2 IB?

Dr15Jones commented 7 years ago

@smuzaffar I've looked further into CORAL and I see the original lock doesn't cover enough of the frontier calls.

Dr15Jones commented 7 years ago

+1 We patched CORAL and that has fixed the problem.