Closed Dr15Jones closed 7 years ago
A new Issue was created by @Dr15Jones Chris Jones.
@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
If the call to coral::FrontierAccess::Query::execute()
were moved from LumiProducer::beginRun
and moved into an ESProducer
then the EventSetup
lock would protect against concurrent access.
assign core
New categories assigned: core
@Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks
assign db
New categories assigned: db
@ggovi you have been requested to review this Pull request/Issue and eventually sign? Thanks
As an intermediate step, we could lock access to the lumi::service::DBService by getting access to the mutex used to protect the conditions system.
@DrDaveD this has to do with concurrent access to the frontier client.
I took a look at the coral interface to Frontier. The code uses a per Connection lock. However, the underlying assumption is the actual third-party code being called is at least thread-friendly. This is not the case with the frontier client interface where multiple connections touch the same memory structures.
I think the best way to fix this problem is to patch coral::FrontierAccess::Connection
to use a mutex shared across all Connection
instances (rather than having one per instances as is the case now).
Do others agree?
@Dr15Jones quoting @xiezhen "the LumiProducer is no more supported. You can just remove all the code from inside, leaving empty structures, so that they can still call."
@Dr15Jones on the coral Frontier fix, it guess it would be anyhow good to have it...
Having a global lock in CORAL would prevent ever having concurrent threads using the frontier client. We have a long term plan to put a lock inside of frontier client and release the lock while the client is waiting on I/O. I think that's a better solution because it would allow interleaving I/O.
@DrDaveD what is the time scale for making the frontier client thread safe? I was hoping to get this patch (which can later be removed) in today since the problem already caused jobs in the IB to crash.
@ggovi Only the LumiProducer
and ExpressLumiProducer
put LumiSummary
and LumiDetails
objects into the edm::LuminosityBlock
and we definitely have code in CMSSW which reads those objects
[cdj@cmslpc26 src]$ git grep 'LumiSummary' | grep Handle
DQM/TrackingMonitor/src/GetLumi.cc: edm::Handle<LumiSummary> lumiSummary_;
DQMServices/Components/plugins/DQMLumiMonitor.cc: edm::Handle<LumiSummary> lumiSummary_;
DataFormats/FWLite/scripts/edmLumisInFiles.py: handle = Handle ('LumiSummary')
HLTrigger/HLTanalyzers/src/EventHeader.cc: edm::Handle<LumiSummary> lumiSummary;
L1Trigger/CSCTrackFinder/test/analysis/LCTOccupancies.cc: edm::Handle<LumiSummary> lumiSummary;
PhysicsTools/FWLite/bin/FWLiteLumiAccess.cc: fwlite::Handle<LumiSummary> summary;
RecoLocalTracker/SiPixelClusterizer/test/TestClusters.cc: edm::Handle<LumiSummary> lumi;
RecoLocalTracker/SiPixelClusterizer/test/TestPixTracks.cc: edm::Handle<LumiSummary> lumi;
RecoLocalTracker/SiPixelClusterizer/test/TestWithTracks.cc: edm::Handle<LumiSummary> lumi;
RecoLuminosity/LumiProducer/plugins/LumiCalculator.cc: edm::Handle<LumiSummary> lumiSummary;
RecoLuminosity/LumiProducer/plugins/LumiCalculator.cc: edm::Handle<LumiSummaryRunHeader> lumiSummaryRH;
RecoLuminosity/LumiProducer/test/TestExpressLumiProducer.cc: Handle<LumiSummary> lumiSummary;
RecoLuminosity/LumiProducer/test/TestLumiCorrectionSource.cc: edm::Handle<LumiSummary> lumisummary;
RecoLuminosity/LumiProducer/test/TestLumiProducer.cc: Handle<LumiSummary> lumiSummary;
[cdj@cmslpc26 src]$ git grep 'LumiDetails' | grep Handle
Calibration/IsolatedParticles/plugins/IsoTrig.cc: edm::Handle<LumiDetails> Lumid;
Calibration/IsolatedParticles/plugins/StudyHLT.cc: edm::Handle<LumiDetails> Lumid;
DPGAnalysis/SiStripTools/plugins/TrackCount.cc: edm::Handle<LumiDetails> ld;
DPGAnalysis/SiStripTools/src/DigiLumiCorrHistogramMaker.cc: edm::Handle<LumiDetails> ld;
DQM/TrackingMonitor/src/GetLumi.cc: edm::Handle<LumiDetails> lumi;
DQMServices/Components/plugins/DQMLumiMonitor.cc: Handle<LumiDetails> lumiDetails;
RecoLocalTracker/SiPixelClusterizer/test/TestClusters.cc: edm::Handle<LumiDetails> ld;
RecoLuminosity/LumiProducer/test/TestExpressLumiProducer.cc: Handle<LumiDetails> lumiDetails;
RecoLuminosity/LumiProducer/test/TestLumiProducer.cc: Handle<LumiDetails> lumiDetails;
Validation/RecoVertex/src/VertexHistogramMaker.cc: edm::Handle<LumiDetails> ld;
So even though LumiProducer isn't being maintained, it appears that it is needed.
My proposed changes are
--- coral/FrontierAccess/src/Connection.h 2017-04-28 09:43:29.945972505 -0500
+++ ../coral/src/FrontierAccess/src/Connection.h 2011-03-22 05:36:50.000000000 -0500
@@ -91,7 +91,7 @@
/// The type converter
TypeConverter* m_typeConverter;
/// The connection lock
- static boost::mutex s_lock;
+ mutable boost::mutex m_lock;
};
} // FrontierAccess namespace
} // coral namespace
--- coral/FrontierAccess/src/Connection.cpp 2017-04-28 09:45:26.111480886 -0500
+++ ../coral/src/FrontierAccess/src/Connection.cpp 2011-03-22 05:36:50.000000000 -0500
@@ -20,7 +20,6 @@
#include "ErrorHandler.h"
#include "Session.h"
#include "TypeConverter.h"
-boost::mutex coral::FrontierAccess::Connection::s_lock{};
coral::FrontierAccess::Connection::Connection( const coral::FrontierAccess::DomainProperties& domainProperties, const std::string& connectionString )
: m_connection(0)
@@ -29,6 +28,7 @@
, m_connected( false )
, m_serverVersion( "" )
, m_typeConverter( new coral::FrontierAccess::TypeConverter( m_domainProperties ) )
+ , m_lock()
{
if( this->m_typeConverter )
this->m_typeConverter->reset( 10 );
@@ -103,7 +103,7 @@
log << coral::Verbose << "Connecting to Frontier server using URL: " << this->m_connectionString << coral::MessageStream::endmsg;
{
- boost::mutex::scoped_lock lock( s_lock );
+ boost::mutex::scoped_lock lock( m_lock );
// Attaching the server
m_connection = new frontier::Connection( m_connectionString );
@@ -126,7 +126,7 @@
m_domainProperties,
m_connectionString,
*m_connection,
- s_lock,
+ m_lock,
schemaName,
*m_typeConverter );
return session;
@davidlange6 @smuzaffar @davidlt Could we use CMSSW_9_1_DEVEL_X to do the following testing?
That should be sufficient to help shake out any other problems related to the framework change.
@Dr15Jones the frontier client fix will not be quick. I was assuming you were going to disable LumiProducer like Zhen said, and in that case would rather not do the coral change. If that's not going to happen, then go ahead and do the coral patch, understanding that it is temporary. I would not try to get it merged upstream.
@davidlange6 @smuzaffar ping
sorry, I think we were both away the weekend.
So - if all of this is just to fix the lumi producer, we should just remove that code instead. Its not supported and not working. @slava77
On May 2, 2017, at 12:39 PM, Chris Jones notifications@github.com wrote:
@davidlange6 @smuzaffar ping
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Getting rid of the LumiProducer (basically all the modules in that package would have to go) would help. However, the DQM also directly uses coral so we still need the change to coral to protect against that code trying to run at the same time as the PoolDBSource.
I missed the DQM issue - which piece(s) of dqm are doing this?
On May 2, 2017, at 2:06 PM, Chris Jones notifications@github.com wrote:
Getting rid of the LumiProducer (basically all the modules in that package would have to go) would help. However, the DQM also directly uses coral so we still need the change to coral to protect against that code trying to run at the same time as the PoolDBSource.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I found them doing
cdj@CDJ-3> git grep 'coral::Conn'
CondCore/CondDB/src/ConnectionPool.cc: coral::ConnectionService connServ;
CondCore/CondDB/src/ConnectionPool.cc: coral::ConnectionService connServ;
CondTools/DQM/interface/ReadBase.h: coral::ConnectionService m_connectionService;
RecoLuminosity/LumiProducer/interface/DBConfig.h: explicit DBConfig(coral::ConnectionService& svc);
[Lots of other RecoLuminosity/LumiProducer cut]
I need resolution on this item in order to make further progress on the framework.
could you make a pr to cmsdist with your coral patch (double digits number of patches now:( )
On May 3, 2017, at 12:27 PM, Chris Jones notifications@github.com wrote:
I need resolution on this item in order to make further progress on the framework.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.
perhaps - i notice that most of your patch is in the coral svn repo for some years now... @ggovi - is it hopeless to get back to the mainstream coral?
On May 3, 2017, at 12:58 PM, Chris Jones notifications@github.com wrote:
I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
IIRC, patchsrc9
is the last one. There is no support for double digits.
https://svnweb.cern.ch/trac/lcgcoral/browser/coral/tags
Looks like the latest one is 3.1.8. We haven't updated CORAL version in CMSSW for the last 5+ years.
On 3 May 2017, at 13:07, David Lange notifications@github.com wrote:
perhaps - i notice that most of your patch is in the coral svn repo for some years now... @ggovi - is it hopeless to get back to the mainstream coral?
it is surely not trivial, since we diverged over the last years… it would require to check the cms coral patches and check if everything has been implemented in the mainstream. Plus validate the rest of the new changes.
On May 3, 2017, at 12:58 PM, Chris Jones notifications@github.com wrote:
I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
that sounds more sustainable than the situation we have today. Could you have a look at the more recent releases of coral?
On May 3, 2017, at 3:09 PM, ggovi notifications@github.com wrote:
On 3 May 2017, at 13:07, David Lange notifications@github.com wrote:
perhaps - i notice that most of your patch is in the coral svn repo for some years now... @ggovi - is it hopeless to get back to the mainstream coral?
it is surely not trivial, since we diverged over the last years… it would require to check the cms coral patches and check if everything has been implemented in the mainstream. Plus validate the rest of the new changes.
On May 3, 2017, at 12:58 PM, Chris Jones notifications@github.com wrote:
I'll make the pull request. As an aside, it seems to me like making our own git repository for coral might be easier to manage at this point.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
ok, i'll have a look.
@davidelange6 Keep in mind that development on coral is essentially halted, and it is not expected to be used at all for run3, so that might make it not worth the effort to update to the latest released version of coral. If we're limited on the number of patches, some patches could be pretty easily combined I expect.
where can we find more information about the plan to replace coral in cmssw? indeed its hard to gather its halted development from the rate of CORAL releases (5 in the last year, though perhaps just to support cmake in lcg)
(part of my interest was created by the mistake in reversing the coral patch but still, its not fantastic that we've forked off with no real developer)
On May 3, 2017, at 6:14 PM, DrDaveD notifications@github.com wrote:
@davidelange6 Keep in mind that development on coral is essentially halted, and it is not expected to be used at all for run3, so that might make it not worth the effort to update to the latest released version of coral. If we're limited on the number of patches, some patches could be pretty easily combined I expect.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I'm not sure the plan is very well documented, but Andrea Formica from ATLAS and Giacomo have been working on a new conditions system for run3 that will not do SQL queries and so will not need CORAL. It still uses the frontier client & server, but between the frontier server and the oracle DB is another server that does the SQL queries. The interface to this other server is higher-level http. The conditions system in the application will be able to contact the other server directly for low volume or pass its requests through frontier for caching.
@davidlange6 @smuzaffar given that CMSSW_9_2 is now open, could we have #18504 reverted in CMSSW_9_2 in order to find any additional thread-safety problems (as well as allow further development of the framework)? If that is considered to high of a risk, could #18504 be reverted in CMSSW_9_2_DEVEL_X?
A crash related to the same problem happened again in the IBs https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc7_amd64_gcc630/CMSSW_9_2_X_2017-05-12-1100/pyRelValMatrixLogs/run/7.0_Cosmics+Cosmics+DIGICOS+RECOCOS+ALCACOS+HARVESTCOS/step3_Cosmics+Cosmics+DIGICOS+RECOCOS+ALCACOS+HARVESTCOS.log
The stack trace is
Thread 4 (Thread 0x7f556ecfe700 (LWP 217673)):
#0 0x00007f55af61befd in nanosleep () from /lib64/libc.so.6
#1 0x00007f55af61bd94 in sleep () from /lib64/libc.so.6
#2 0x00007f55a8bde723 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007f55b215f542 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#5 0x00007f55b215fe6f in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#6 0x00007f55b2164776 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#7 0x00007f55b216b260 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#8 0x00007f5592f4f43d in frontier::FrontierException::FrontierException(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#9 0x00007f5592f59159 in frontier::RuntimeError::RuntimeError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#10 0x00007f5592f55e68 in frontier::Request::encodeParam(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#11 0x00007f5592fcdda7 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#12 0x00007f5592fe9a6d in coral::FrontierAccess::Query::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#13 0x00007f557fd9f404 in LumiProducer::getLumiDataId(coral::ISchema const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/lib/slc7_amd64_gcc630/pluginLumiProducer.so
#14 0x00007f557fda94b3 in LumiProducer::beginRun(edm::Run const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/lib/slc7_amd64_gcc630/pluginLumiProducer.so
#15 0x00007f55b20b965b in edm::one::EDProducerBase::doBeginRun(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#16 0x00007f55b200d6b0 in edm::WorkerT<edm::one::EDProducerBase>::implDoBegin(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#17 0x00007f55b1ff76c6 in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#18 0x00007f55b1ff794c in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#19 0x00007f55b2009c96 in void edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#20 0x00007f55b200a614 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#21 0x00007f55b200a6b1 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#22 0x00007f55b0c42983 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f55ad60fe00, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:501
#23 0x00007f55b0c3b9d2 in tbb::internal::arena::process (this=0x7f55ad783780, s=...) at ../../src/tbb/arena.cpp:159
#24 0x00007f55b0c3a4cb in tbb::internal::market::process (this=0x7f55ad797900, j=...) at ../../src/tbb/market.cpp:677
#25 0x00007f55b0c367c6 in tbb::internal::rml::private_worker::run (this=0x7f55ad5b5180) at ../../src/tbb/private_server.cpp:271
#26 0x00007f55b0c369f9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:224
#27 0x00007f55af927dc5 in start_thread () from /lib64/libpthread.so.0
#28 0x00007f55af654ced in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f556f6ff700 (LWP 217672)):
#0 0x00007f55af61befd in nanosleep () from /lib64/libc.so.6
#1 0x00007f55af61bd94 in sleep () from /lib64/libc.so.6
#2 0x00007f55a8bde723 in sig_pause_for_stacktrace () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3 <signal handler called>
#4 0x00007f55af64f469 in syscall () from /lib64/libc.so.6
#5 0x00007f55b0c369d2 in tbb::internal::futex_wait (comparand=2, futex=0x7f55ad5b512c) at ../../include/tbb/machine/linux_common.h:60
#6 tbb::internal::binary_semaphore::P (this=0x7f55ad5b512c) at ../../src/tbb/semaphore.h:206
#7 rml::internal::thread_monitor::commit_wait (c=<synthetic pointer>..., this=0x7f55ad5b5120) at ../../src/rml/include/../server/thread_monitor.h:259
#8 tbb::internal::rml::private_worker::run (this=0x7f55ad5b5100) at ../../src/tbb/private_server.cpp:278
#9 0x00007f55b0c369f9 in tbb::internal::rml::private_worker::thread_routine (arg=<optimized out>) at ../../src/tbb/private_server.cpp:224
#10 0x00007f55af927dc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f55af654ced in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f5597bbf700 (LWP 217563)):
#0 0x00007f55af92eca9 in waitpid () from /lib64/libpthread.so.0
#1 0x00007f55a8bde917 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#2 0x00007f55a8bdeff5 in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3 0x00007f55aff10c3f in std::execute_native_thread_routine (__p=0x7f5598171740) at ../../../../../libstdc++-v3/src/c++11/thread.cc:83
#4 0x00007f55af927dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f55af654ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f55add04c80 (LWP 217511)):
#0 0x00007f55af64a69d in poll () from /lib64/libc.so.6
#1 0x00007f55a8bdee44 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#2 0x00007f55a8bdf61a in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#3 0x00007f55a8be0ba5 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f55b0e6d9cd in je_tcache_dalloc_small (slow_path=false, binind=<optimized out>, ptr=0x7f5592e93c00, tcache=0x7f55adc42000, tsd=<optimized out>) at include/jemalloc/internal/tcache.h:424
#6 je_arena_dalloc (slow_path=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00, tsdn=<optimized out>) at include/jemalloc/internal/arena.h:1441
#7 je_idalloctm (slow_path=false, is_metadata=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1170
#8 je_iqalloc (slow_path=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00, tsd=<optimized out>) at include/jemalloc/internal/jemalloc_internal.h:1187
#9 ifree (tsd=<optimized out>, slow_path=false, tcache=0x7f55adc42000, ptr=0x7f5592e93c00) at src/jemalloc.c:1896
#10 free (ptr=0x7f5592e93c00) at src/jemalloc.c:2016
#11 0x00007f5592f55393 in fn_zfree () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#12 0x00007f55b0863b1e in deflateEnd () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libz.so.1
#13 0x00007f5592f553b5 in fn_decleanup () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#14 0x00007f5592f555b6 in fn_gzip_str () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#15 0x00007f5592f556bc in fn_gzip_str2urlenc () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#16 0x00007f5592f55db7 in frontier::Request::encodeParam(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/libfrontier_client.so.2
#17 0x00007f5592fcdda7 in coral::FrontierAccess::Statement::execute(coral::AttributeList const&, int) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#18 0x00007f5592fe9a6d in coral::FrontierAccess::Query::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw-patch/CMSSW_9_2_X_2017-05-12-1100/external/slc7_amd64_gcc630/lib/liblcg_FrontierAccess.so
#19 0x00007f559697410c in cond::persistency::PAYLOAD::Table::select(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, cond::Binary&, cond::Binary&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libCondCoreCondDB.so
#20 0x00007f5592083114 in std::shared_ptr<L1GtPrescaleFactors> cond::persistency::Session::fetchPayload<L1GtPrescaleFactors>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginCondCoreL1TPlugins.so
#21 0x00007f55920834ab in cond::persistency::PayloadProxy<L1GtPrescaleFactors>::loadPayload() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginCondCoreL1TPlugins.so
#22 0x00007f5592080c47 in edm::eventsetup::DataProxyTemplate<L1GtPrescaleFactorsTechTrigRcd, L1GtPrescaleFactors>::getImpl(edm::eventsetup::EventSetupRecord const&, edm::eventsetup::DataKey const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginCondCoreL1TPlugins.so
#23 0x00007f55b204930f in edm::eventsetup::DataProxy::get(edm::eventsetup::EventSetupRecord const&, edm::eventsetup::DataKey const&, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#24 0x00007f55b206473b in edm::eventsetup::EventSetupRecord::getFromProxy(edm::eventsetup::DataKey const&, edm::eventsetup::ComponentDescription const*&, bool) const () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#25 0x00007f558e6cbe4d in L1GtTriggerMenuLiteProducer::retrieveL1EventSetup(edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginEventFilterL1GlobalTriggerRawToDigi.so
#26 0x00007f558e6cd491 in L1GtTriggerMenuLiteProducer::beginRunProduce(edm::Run&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/pluginEventFilterL1GlobalTriggerRawToDigi.so
#27 0x00007f55b20b9672 in edm::one::EDProducerBase::doBeginRun(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#28 0x00007f55b200d6b0 in edm::WorkerT<edm::one::EDProducerBase>::implDoBegin(edm::RunPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#29 0x00007f55b1ff76c6 in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#30 0x00007f55b1ff794c in bool edm::Worker::runModule<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#31 0x00007f55b2009c96 in void edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0>::Context const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#32 0x00007f55b200a614 in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#33 0x00007f55b200a6b1 in edm::SerialTaskQueue::QueuedTask<void edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1}>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::RunPrincipal, (edm::BranchActionType)0> >::execute()::{lambda()#1} const&)::{lambda()#1}>::execute() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#34 0x00007f55b0c42983 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::local_wait_for_all (this=0x7f55ad792600, parent=..., child=<optimized out>) at ../../src/tbb/custom_scheduler.h:501
#35 0x00007f55b207b230 in edm::EventProcessor::beginRun(statemachine::Run const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#36 0x00007f55b1fc3b89 in statemachine::HandleRuns::beginRun(statemachine::Run const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#37 0x00007f55b1fc3c3c in statemachine::HandleRuns::setupCurrentRun() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#38 0x00007f55b1fc5445 in statemachine::NewRun::NewRun(boost::statechart::state<statemachine::NewRun, statemachine::HandleRuns, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#39 0x00007f55b1fcb10e in boost::statechart::state<statemachine::NewRun, statemachine::HandleRuns, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::shallow_construct(boost::intrusive_ptr<statemachine::HandleRuns> const&, boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#40 0x00007f55b1fcdb18 in boost::statechart::state<statemachine::HandleRuns, statemachine::HandleFiles, statemachine::NewRun, (boost::statechart::history_mode)0>::deep_construct(boost::intrusive_ptr<statemachine::HandleFiles> const&, boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#41 0x00007f55b1fcdcda in boost::statechart::simple_state<statemachine::FirstFile, statemachine::HandleFiles, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#42 0x00007f55b2081e74 in boost::statechart::state_machine<statemachine::Machine, statemachine::Starting, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&) () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#43 0x00007f55b2075901 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/nweek-02471/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-10-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so
#44 0x000000000040e9d4 in main::{lambda()#1}::operator()() const ()
#45 0x000000000040d2a5 in main ()
@smuzaffar Did the patch to CORAL make it into the CMSSW_9_2 IB?
@smuzaffar I've looked further into CORAL and I see the original lock doesn't cover enough of the frontier calls.
+1 We patched CORAL and that has fixed the problem.
The framework now uses multiple threads to process the global begin Run transition. During that transition
LumiProducer::beginRun
is called and it directly callscoral::FrontierAccess::Query::execute()
. Unfortunately, modules which request data from the conditions database will also trigger calls to frontier which leads to race conditions: See https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc7_amd64_gcc630/CMSSW_9_1_X_2017-04-26-2300/pyRelValMatrixLogs/run/1000.0_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT/step2_RunMinBias2011A+RunMinBias2011A+TIER0+SKIMD+HARVESTDfst2+ALCASPLIT.log