Closed mmusich closed 4 days ago
cms-bot internal usage
A new Issue was created by @mmusich.
@antoniovilela, @sextonkennedy, @rappoccio, @Dr15Jones, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Does the same crash happen in plain CMSSW_14_0_6
?
Does the same crash happen in plain
CMSSW_14_0_6
?
yes.
assign hlt
New categories assigned: hlt
@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks
assign daq
New categories assigned: daq
@emeschi,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks
It doesn't crash if this is appended:
process.FastMonitoringService = cms.Service( "FastMonitoringService")
process.EvFDaqDirector = cms.Service( "EvFDaqDirector",
baseDir = cms.untracked.string( "." ),
buBaseDir = cms.untracked.string( "." ),
buBaseDirsAll = cms.untracked.vstring( ),
buBaseDirsNumStreams = cms.untracked.vint32( ),
runNumber = cms.untracked.uint32( 380647 ),
useFileBroker = cms.untracked.bool( False ),
fileBrokerHostFromCfg = cms.untracked.bool( True ),
fileBrokerHost = cms.untracked.string( "" ),
fileBrokerPort = cms.untracked.string( "8080" ),
fileBrokerKeepAlive = cms.untracked.bool( True ),
fileBrokerUseLocalLock = cms.untracked.bool( True ),
fuLockPollInterval = cms.untracked.uint32( 2000 ),
outputAdler32Recheck = cms.untracked.bool( False ),
directorIsBU = cms.untracked.bool( False ),
hltSourceDirectory = cms.untracked.string( "" ),
mergingPset = cms.untracked.string( "" )
)
along with mkdir run380647
.
From the code it is not clear why it would crash. Maybe it's the cast from MicroStateService to FastMonitoringService pointer (in case dummy MSS is inserted somehow). We are planning to finally remove MicroStateService base class, it will happen in 14_1_X (soon).
From the code it is not clear why it would crash. Maybe it's the cast from MicroStateService to FastMonitoringService pointer (in case dummy MSS is inserted somehow). We are planning to finally remove MicroStateService base class, it will happen in 14_1_X (soon).
It is not that, even if removing check for the FMS service there is still a crash.
Indeed, hltGetConfiguration removes these (see https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Configuration/python/Tools/confdb.py#L809)
# remove the DAQ modules and the online definition of the DQMStore and DQMFileSaver
# unless a hilton-like configuration has been requested
if not self.config.hilton:
self.options['services'].append( "-EvFDaqDirector" )
self.options['services'].append( "-FastMonitoringService" )
self.options['services'].append( "-DQMStore" )
self.options['modules'].append( "-hltDQMFileSaver" )
self.options['modules'].append( "-hltDQMFileSaverPB" )
It is recommended to use minimal
or none
output in the hltGetConfiguration command, or at least explicitly remove the "-RatesMonitoring"
path.
I ran this now on a FU machine (with a GPU) and I'm getting a bit different stack trace with more information:
Thread 1 (Thread 0x7f5639321640 (LWP 2990918) "cmsRun"):
#0 0x00007f5639eeb0e1 in poll () from /lib64/libc.so.6
#1 0x00007f5630bbe6af in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2 0x00007f5630b72dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3 0x00007f5630b73720 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f5639e95f7b in __memmove_avx_unaligned () from /lib64/libc.so.6
#6 0x00007f55b54723ba in Json::duplicateAndPrefixStringValue(char const*, unsigned int) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/external/el8_amd64_gcc12/lib/libtensorflow_framework.so.2
#7 0x00007f55b5472582 in Json::Value::Value(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/external/el8_amd64_gcc12/lib/libtensorflow_framework.so.2
#8 0x00007f55874cf94e in HLTriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, HLTriggerJSONMonitoringData::lumisection*) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginHLTriggerJSONMonitoringPlugins.so
#9 0x00007f55874d4388 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, HLTriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginHLTriggerJSONMonitoringPlugins.so
Here it seems that it tries to use Json::Value from the tensorflow library, while we have an integrated (older and modified for thread safety) version in EventFilter/Utilities
.
https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/interface/json.h
https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/interface/value.h
That version doesn't actually include Json::duplicateAndPrefixStringValue
function anywhere.
I think what happens is, when EventFilter/Utilities library is loaded by using services, correct version is used and there is no crash.
If not, then we end up using the tensorflow version and, as code is compiled with headers from EventFilter/Utilities
, it causes memory corruption and a crash either here or later in the module. On lxplus I also noticed the crash happens on a simple string size()
call done after json::Value
is defined in the code, and removing json::Value
removes the crash.
Sounds like a one-definition rule violation. If the copy EventFilter/Utilities
still needs to be kept, I'd recommend moving all the relevant code into a CMS-specific namespace.
Sounds like a one-definition rule violation. If the copy
EventFilter/Utilities
still needs to be kept, I'd recommend moving all the relevant code into a CMS-specific namespace.
I wouldn't dare to change to a different version in the short term, and in the long term we were already thinking of evaluating different json implementations.
Using a namespace seems fine (It could be "evf" which is used for the most of the EventFilter/Utilities
code). I'll work on those changes.
in the long term we were already thinking of evaluating different json implementations.
In other CMSSW packages we've been using https://github.com/nlohmann/json , which is available as an external via <use name="json"/>
.
Updated in: https://github.com/cms-sw/cmssw/pull/44989 I used jsoncollector namespace.
Crash is gone with 14_0_6 with the backport. I'll open backport PR as well.
proposed fixes are merged:
CMSSW_14_1_X
: master)CMSSW_14_0_X
)@cms-sw/daq-l2 this issue could be closed, right?
+1 yes
This issue is fully signed and ready to be closed.
@cmsbuild, please close
@silviodonato reported a crash in
CMSSW_14_0_6_MULTIARCHS
when running:concerning:
Trying to reproduce with a slightly different setup (e.g. the script below)
I get a different crash (also on CPU-only) involving
As additional information, it looks like it depends on the output configuration. Setting:
--output full
[*] caveat at https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideGlobalHLT#General_Usage--output minimal
--output none
it runs without problems, whereas setting:
--output all
it crashes are reported above.
FYI @missirol @fwyzard @cms-sw/hlt-l2