cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.05k stars 4.23k forks source link

Offline crashes in `{HLT,L1}TriggerJSONMonitoring` in `CMSSW_14_0_6_MULTIARCHS` #44975

Closed mmusich closed 4 days ago

mmusich commented 2 weeks ago

@silviodonato reported a crash in CMSSW_14_0_6_MULTIARCHS when running:

ssh lxplus8.cern.ch
export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_6_MULTIARCHS
cd CMSSW_14_0_6_MULTIARCHS/src
cmsenv
hltGetConfiguration run:380647 --globaltag  140X_dataRun3_HLT_v3  --input file:/eos/cms/tier0/store/data/Run2024D/EphemeralHLTPhysics0/RAW/v1/000/380/647/00000/a8bb2f4f-008c-454b-8a8c-f77ff51e8fcf.root

concerning:

Thread 1 (Thread 0x7fe7ae29d640 (LWP 1513991) "cmsRun"):
#0  0x00007fe7aee6a301 in poll () from /lib64/libc.so.6
#1  0x00007fe7a26f62ff in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe7a26a9afc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fe7a26aa460 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe7aee14e41 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#6  0x00007fe7af8117ab in std::char_traits<char>::copy (__n=49, __s2=<optimized out>, __s1=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/char_traits.h:435
#7  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=49, __s=<optimized out>, __d=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/basic_string.h:431
#8  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy (__n=49, __s=<optimized out>, __d=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/basic_string.h:426
#9  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign (this=0x7fffc5118a40, __str=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:291
#10 0x00007fe711d1caf6 in L1TriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, L1TriggerJSONMonitoringData::lumisection*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#11 0x00007fe711d1d8c8 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, L1TriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#12 0x00007fe7b18c1ff5 in edm::global::EDAnalyzerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007fe7b18b9da0 in edm::WorkerT<edm::global::EDAnalyzerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007fe7b1807a7f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fe7b17f5ef8 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fe7b17b8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fe7afff3281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe7acc99380) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe7acc99380) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#20 0x00007fe7b17c941b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00007fe7b17d324d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#22 0x00007fe7b17d37b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#24 0x00007fe7affdf9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#25 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#26 0x000000000040517c in main ()

Current Modules:

Module: L1TriggerJSONMonitoring:hltL1TriggerJSONMonitoring (crashed)Segmentation fault (core dumped)

Trying to reproduce with a slightly different setup (e.g. the script below)

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380647 \
            --globaltag 140X_dataRun3_HLT_v3 \
            --input file:/eos/cms/tier0/store/data/Run2024D/EphemeralHLTPhysics0/RAW/v1/000/380/647/00000/a8bb2f4f-008c-454b-8a8c-f77ff51e8fcf.root > hlt_run380647.py

cat <<@EOF >> hlt_run380647.py
process.options.wantSummary = False
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt_run380647.py &> hlt.log

I get a different crash (also on CPU-only) involving

Thread 1 (Thread 0x7fed272ac640 (LWP 2328682) "cmsRun"):
#0  0x00007fed27e79301 in poll () from /lib64/libc.so.6
#1  0x00007fed1b72f2ff in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fed1b6e2afc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fed1b6e3460 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fed27e23e37 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#6  0x00007fed27de7009 in __GI__IO_file_xsputn () from /lib64/libc.so.6
#7  0x00007fed27ddc19c in fwrite () from /lib64/libc.so.6
#8  0x00007fed2881127d in std::basic_streambuf<char, std::char_traits<char> >::sputn (__n=50, __s=0x0, this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/streambuf:455
#9  std::__ostream_write<char, std::char_traits<char> > (__n=50, __s=0x0, __out=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:51
#10 std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x0, __n=50) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_13_2_0_pre2-el8_amd64_gcc12/build/CMSSW_13_2_0_pre2-build/BUILD/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/gcc-12.3.1/obj/x86_64-redhat-linux-gnu/libstdc++-v3/include/bits/ostream_insert.h:102
#11 0x00007fec9955a16b in HLTriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, HLTriggerJSONMonitoringData::lumisection*) const () from /tmp/musich/hltL1TriggerJSONMonitoring/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#12 0x00007fec9955f0c8 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, HLTriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /tmp/musich/hltL1TriggerJSONMonitoring/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginHLTriggerJSONMonitoringPlugins.so
#13 0x00007fed2a8d0ff5 in edm::global::EDAnalyzerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007fed2a8c8da0 in edm::WorkerT<edm::global::EDAnalyzerBase>::implDoEnd(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fed2a816a7f in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fed2a804ef8 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)3> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fed2a7c7bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fed29002281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fed25c83e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#19 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fed25c83e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#20 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#21 0x00007fed2a7d841b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#22 0x00007fed2a7e224d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fed2a7e27b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#25 0x00007fed28fee9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#26 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#27 0x000000000040517c in main ()

Current Modules:

Module: HLTriggerJSONMonitoring:hltHLTriggerJSONMonitoring (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

As additional information, it looks like it depends on the output configuration. Setting:

it runs without problems, whereas setting:

it crashes are reported above.

FYI @missirol @fwyzard @cms-sw/hlt-l2

cmsbuild commented 2 weeks ago

cms-bot internal usage

cmsbuild commented 2 weeks ago

A new Issue was created by @mmusich.

@antoniovilela, @sextonkennedy, @rappoccio, @Dr15Jones, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 2 weeks ago

Does the same crash happen in plain CMSSW_14_0_6 ?

mmusich commented 2 weeks ago

Does the same crash happen in plain CMSSW_14_0_6 ?

yes.

makortel commented 2 weeks ago

assign hlt

cmsbuild commented 2 weeks ago

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

Martin-Grunewald commented 2 weeks ago

assign daq

cmsbuild commented 2 weeks ago

New categories assigned: daq

@emeschi,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

smorovic commented 2 weeks ago

It doesn't crash if this is appended:

process.FastMonitoringService = cms.Service( "FastMonitoringService")

process.EvFDaqDirector = cms.Service( "EvFDaqDirector",
    baseDir = cms.untracked.string( "." ),
    buBaseDir = cms.untracked.string( "." ),
    buBaseDirsAll = cms.untracked.vstring(  ),
    buBaseDirsNumStreams = cms.untracked.vint32(  ),
    runNumber = cms.untracked.uint32( 380647 ),
    useFileBroker = cms.untracked.bool( False ),
    fileBrokerHostFromCfg = cms.untracked.bool( True ),
    fileBrokerHost = cms.untracked.string( "" ),
    fileBrokerPort = cms.untracked.string( "8080" ),
    fileBrokerKeepAlive = cms.untracked.bool( True ),
    fileBrokerUseLocalLock = cms.untracked.bool( True ),
    fuLockPollInterval = cms.untracked.uint32( 2000 ),
    outputAdler32Recheck = cms.untracked.bool( False ),
    directorIsBU = cms.untracked.bool( False ),
    hltSourceDirectory = cms.untracked.string( "" ),
    mergingPset = cms.untracked.string( "" )
)

along with mkdir run380647.

From the code it is not clear why it would crash. Maybe it's the cast from MicroStateService to FastMonitoringService pointer (in case dummy MSS is inserted somehow). We are planning to finally remove MicroStateService base class, it will happen in 14_1_X (soon).

smorovic commented 2 weeks ago

From the code it is not clear why it would crash. Maybe it's the cast from MicroStateService to FastMonitoringService pointer (in case dummy MSS is inserted somehow). We are planning to finally remove MicroStateService base class, it will happen in 14_1_X (soon).

It is not that, even if removing check for the FMS service there is still a crash.

Martin-Grunewald commented 2 weeks ago

Indeed, hltGetConfiguration removes these (see https://github.com/cms-sw/cmssw/blob/master/HLTrigger/Configuration/python/Tools/confdb.py#L809)

    # remove the DAQ modules and the online definition of the DQMStore and DQMFileSaver                                                                        
    # unless a hilton-like configuration has been requested                     
    if not self.config.hilton:
      self.options['services'].append( "-EvFDaqDirector" )
      self.options['services'].append( "-FastMonitoringService" )
      self.options['services'].append( "-DQMStore" )
      self.options['modules'].append( "-hltDQMFileSaver" )
      self.options['modules'].append( "-hltDQMFileSaverPB" )

It is recommended to use minimal or none output in the hltGetConfiguration command, or at least explicitly remove the "-RatesMonitoring" path.

smorovic commented 2 weeks ago

I ran this now on a FU machine (with a GPU) and I'm getting a bit different stack trace with more information:

Thread 1 (Thread 0x7f5639321640 (LWP 2990918) "cmsRun"):
#0  0x00007f5639eeb0e1 in poll () from /lib64/libc.so.6
#1  0x00007f5630bbe6af in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f5630b72dbc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f5630b73720 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f5639e95f7b in __memmove_avx_unaligned () from /lib64/libc.so.6
#6  0x00007f55b54723ba in Json::duplicateAndPrefixStringValue(char const*, unsigned int) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/external/el8_amd64_gcc12/lib/libtensorflow_framework.so.2
#7  0x00007f55b5472582 in Json::Value::Value(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/external/el8_amd64_gcc12/lib/libtensorflow_framework.so.2
#8  0x00007f55874cf94e in HLTriggerJSONMonitoring::globalEndLuminosityBlockSummary(edm::LuminosityBlock const&, edm::EventSetup const&, HLTriggerJSONMonitoringData::lumisection*) const () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginHLTriggerJSONMonitoringPlugins.so
#9  0x00007f55874d4388 in virtual thunk to edm::global::impl::LuminosityBlockSummaryCacheHolder<edm::global::EDAnalyzerBase, HLTriggerJSONMonitoringData::lumisection>::doEndLuminosityBlockSummary_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6/lib/el8_amd64_gcc12/pluginHLTriggerJSONMonitoringPlugins.so

Here it seems that it tries to use Json::Value from the tensorflow library, while we have an integrated (older and modified for thread safety) version in EventFilter/Utilities. https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/interface/json.h https://github.com/cms-sw/cmssw/blob/master/EventFilter/Utilities/interface/value.h That version doesn't actually include Json::duplicateAndPrefixStringValue function anywhere.

I think what happens is, when EventFilter/Utilities library is loaded by using services, correct version is used and there is no crash. If not, then we end up using the tensorflow version and, as code is compiled with headers from EventFilter/Utilities, it causes memory corruption and a crash either here or later in the module. On lxplus I also noticed the crash happens on a simple string size() call done after json::Value is defined in the code, and removing json::Value removes the crash.

makortel commented 2 weeks ago

Sounds like a one-definition rule violation. If the copy EventFilter/Utilities still needs to be kept, I'd recommend moving all the relevant code into a CMS-specific namespace.

smorovic commented 2 weeks ago

Sounds like a one-definition rule violation. If the copy EventFilter/Utilities still needs to be kept, I'd recommend moving all the relevant code into a CMS-specific namespace.

I wouldn't dare to change to a different version in the short term, and in the long term we were already thinking of evaluating different json implementations. Using a namespace seems fine (It could be "evf" which is used for the most of the EventFilter/Utilities code). I'll work on those changes.

fwyzard commented 2 weeks ago

in the long term we were already thinking of evaluating different json implementations.

In other CMSSW packages we've been using https://github.com/nlohmann/json , which is available as an external via <use name="json"/>.

smorovic commented 2 weeks ago

Updated in: https://github.com/cms-sw/cmssw/pull/44989 I used jsoncollector namespace.

smorovic commented 2 weeks ago

Crash is gone with 14_0_6 with the backport. I'll open backport PR as well.

mmusich commented 1 week ago

proposed fixes are merged:

mmusich commented 1 week ago

+hlt

mmusich commented 1 week ago

@cms-sw/daq-l2 this issue could be closed, right?

smorovic commented 1 week ago

+1 yes

cmsbuild commented 1 week ago

This issue is fully signed and ready to be closed.

mmusich commented 4 days ago

@cmsbuild, please close