cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.09k stars 4.32k forks source link

HLT crashes in Run 380399 #44923

Closed trtomei closed 2 months ago

trtomei commented 6 months ago

Crashes observed in collisions Run 380399. Stack traces:

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:33:54 CEST 2024
Thread 1 (Thread 0x7f9b37c4a640 (LWP 552137) "cmsRun"):
#0 0x00007f9b388160e1 in poll () from /lib64/libc.so.6
#1 0x00007f9b2eb792ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007f9b2eb2cafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007f9b2eb2d460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9b38ac882b in __lll_lock_wait () from /lib64/libpthread.so.0
#6 0x00007f9b38ac1ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#7 0x00007f9ad36b5266 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt::tryReuseCachedBlock(cms::alpakatools::CachingAllocator >::BlockDescriptor&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#8 0x00007f9ad36b641f in alpaka_cuda_async::EcalRawToDigiPortable::produce(alpaka_cuda_async::device::Event&, alpaka_cuda_async::device::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#9 0x00007f9ad36b1f73 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#10 0x00007f9b3b27a47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007f9b3b25ec2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007f9b3b1e6f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007f9b3b1e74c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007f9b3b157bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007f9b39992281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f9b3b16841b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007f9b3b17224d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00007f9b3b1727b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f9b3997e9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:25:57 CEST 2024
Thread 1 (Thread 0x7fcd588eb640 (LWP 2367851) "cmsRun"):
#0 0x00007fcd594b70e1 in poll () from /lib64/libc.so.6
#1 0x00007fcd4f8092ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007fcd4f7bcafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007fcd4f7bd460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fcd59765020 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#6 0x00007fcd3ae57e86 in ?? () from /lib64/libcuda.so.1
#7 0x00007fcd3ab754d7 in ?? () from /lib64/libcuda.so.1
#8 0x00007fcd3ac46009 in ?? () from /lib64/libcuda.so.1
#9 0x00007fcd3c7b41b7 in ?? () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#10 0x00007fcd3c7f0490 in cudaEventQuery () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#11 0x00007fcd464a00a5 in cms::alpakatools::EventCache<alpaka::EventUniformCudaHipRt::get(alpaka::DevUniformCudaHipRt) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#12 0x00007fcd464a436b in alpaka_cuda_async::detail::EDMetadataSentry::EDMetadataSentry(edm::StreamID) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#13 0x00007fccf4392ed5 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#14 0x00007fcd5bf1b47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fcd5beffc2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fcd5be87f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fcd5be884c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fcd5bdf8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fcd5a633281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#22 0x00007fcd5be0941b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fcd5be1324d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00007fcd5be137b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#25 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#26 0x00007fcd5a61f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#27 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#28 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

We tried to reproduce it with the following recipe, but it didn't reproduce.

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380399 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000145_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000225_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000211_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000280_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000281_fu-c2b01-28-01_pid552137.root'\
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Message #8 in the first stack trace seems to point to alpaka_cuda_async::EcalRawToDigiPortable::produce() method.

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

Best regards, Thiago (for FOG)

cmsbuild commented 6 months ago

cms-bot internal usage

cmsbuild commented 6 months ago

A new Issue was created by @trtomei.

@makortel, @smuzaffar, @sextonkennedy, @rappoccio, @Dr15Jones, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

trtomei commented 6 months ago

Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different: http://cmsonline.cern.ch/cms-elog/1213820

The stack traces that I posted above are the ones I got from F3mon, though.

missirol commented 6 months ago

For reference, here are the log files of the 2 jobs. old_hlt_run380399_pid552137.log old_hlt_run380399_pid2367851.log

As reported in the elog above, they both contain the following.

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77:
void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL,
false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc =
alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]:
block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.
makortel commented 6 months ago

assign heterogeneous, reconstruction

cmsbuild commented 6 months ago

New categories assigned: heterogeneous,reconstruction

@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 6 months ago

So in both cases the assertion in https://github.com/cms-sw/cmssw/blob/8391245a8a5febd6fdaec1e66395b67f9d2dd3db/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h#L76-L80 fails, after which the processes get stuck, until they gets killed 6-7 hours later. (and because of the external termination the stack traces of only "the crashed thread" are meaningless alone)

makortel commented 6 months ago

The state of cmsRun seems similar to https://github.com/cms-sw/cmssw/issues/44769#issuecomment-2066613569 : all the TBB threads are stuck in acquiring locks (old_hlt_run380399_pid2367851.log has two TBB threads that don't have any work to do), and the CUDA threads are doing something and presumably holding the lock.

makortel commented 6 months ago

FYI @AdrianoDee

mmusich commented 6 months ago

Nota bene: the EcalRawToDigi may be a red herring, the original ELOG here mentions something completely different:

@trtomei, all circumstantial evidence points to vertex reconstruction. Can you edit the title to remove the EcalRawToDigi mention?

dan131riley commented 6 months ago

For emphasis: external termination request means it timed out, likely due to a deadlock. The timeout signal will be delivered to thread 1, but to diagnose the deadlock we need to see the stack traces of all the threads.

missirol commented 6 months ago

Looks like this happened again. One LS (run 380624, LS 411) could not be "closed" because one process was stuck and had to be terminated manually hours later. The full log (see below) shows again

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: [..]
 block: [0,0,0], thread: [21,0,0] Assertion `wv[i] > 0.f` failed.

similarly to https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2098979035.

http://cmsonline.cern.ch/cms-elog/1214996

old_hlt_run380624_pid2375140.log

VinInn commented 6 months ago

There should be no assert on GPU in first place. In my opinion this is something that was seen already in the past. in standard CUDA in CMSSW we never had the assert on.

mmusich commented 6 months ago

maybe I missing something

https://github.com/cms-sw/cmssw/blob/22f751f0592542ae94a797e3fc22294ddc4626cf/RecoTracker/PixelVertexFinding/plugins/gpuFitVertices.h#L70

VinInn commented 6 months ago

The assert in CUDA were compiled out because they are slow. There is another thread on the subject. I understood that (by mistake) they are back since ALPAKA.

Of course the assert is still there on CPU but it will not print block: [0,0,0], thread: [21,0,0] so this is coming from CUDA and in the past it would not trigger!

fwyzard commented 6 months ago

Currently we do have asserts in Alpaka code, because the code is new and IMHO it's better to get meaningful errors with a well defined location than random failures.

Within alpaka code:

I understood that (by mistake) they are back since ALPAKA.

The plan is to change all assert(...) in alpaka code to ALPAKA_ASSERT_ACC(...), then measure the impact of enabling/disabling them. I'd probably keep them enabled for a 1%-2% slowdown, disable them for a 5% slowdown.

The fact that asserts may be causing deadlocks is a different matter: in principle, a failed assert in gpu code should cause an asynchronous error, and the next CUDA runtime function should detect it and fail.

Do we have a way to reproduce the problem offline ?

VinInn commented 6 months ago

here is the old issue on CPU

37820

VinInn commented 6 months ago

Did anybody reproduce the assert offline?

gparida commented 5 months ago

This crash seems to appear again in the collision run 381067. Attaching the error log to this comment old_hlt_run381067_pid1000282.log

We were unable to reproduce the error after running the HLT config again on the error stream files.

VinInn commented 5 months ago

Do not see any sign of assert in the log from run 381067: can somebody post a script to reproduce it offline?

mmusich commented 5 months ago

Do not see any sign of assert in the log from run 381067:

it's at line 36196 in the log attached:

src/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h:77: void alpaka_cuda_async::vertexFinder::fitVertices(const TAcc &, reco::ZVertexLayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, vertexFinder::PixelVertexWSSoALayout<128UL, false>::ViewTemplateFreeParams<128UL, false, true, false> &, float) [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>]: block: [0,0,0], thread: [24,0,0] Assertion `wv[i] > 0.f` failed.
mmusich commented 5 months ago

can somebody post a script to reproduce it offline?

@gparida if you tried to reproduce offline, can you post the script you used here?

fwyzard commented 5 months ago

Note that (also) in this case, the job was stuck not processing data for a while, and was eventually killed.

gparida commented 5 months ago

I tried reproducing it (CMSSW_14_0_7_MULTIARCHS) offline (unsuccessfully). Used the following:

#!/bin/bash -ex

# CMSSW_14_0_7_MULTIARCHS

hltGetConfiguration run:381067 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000095_fu-c2b04-28-01_pid1000282.root,'\
'/store/group/tsg/FOG/error_stream_root/run381067/run381067_ls0164_index000109_fu-c2b04-28-01_pid1000282.root'&> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Also, tried to rerun the job with

process.options.numberOfThreads = 32
process.options.numberOfStreams = 32
makortel commented 5 months ago

Attaching the error log to this comment old_hlt_run381067_pid1000282.log

Umm, the log indicates the job had 992 threads spawned by OpenMP, all in state

Thread 1036 (Thread 0x7fb6e37e2700 (LWP 1012080) "cmsRun"):
#0  futex_wait (val=11872, addr=0x7fbae8882b44) at ../../../libgomp/config/linux/x86/futex.h:97
#1  do_wait (val=11872, addr=0x7fbae8882b44) at ../../../libgomp/config/linux/wait.h:67
#2  gomp_barrier_wait_end (bar=0x7fbae8882b40, state=11872) at ../../../libgomp/config/linux/bar.c:48
#3  0x00007fbb369263e8 in gomp_simple_barrier_wait (bar=0x7fbae8882b40) at ../../../libgomp/config/posix/simple-bar.h:60
#4  gomp_thread_start (xdata=<optimized out>) at ../../../libgomp/team.c:133
#5  0x00007fbc4c0191ca in start_thread () from /lib64/libpthread.so.0
#6  0x00007fbc4bc84e73 in clone () from /lib64/libc.so.6

What is bringing OpenMP threading?

VinInn commented 5 months ago

first library in the list! ldd RivetGeneratorInterfaceRivetInterface.so | grep gomp libgomp.so.1 => /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02838/slc7_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib64/libgomp.so.1 (0x00007f91271d0000)

VinInn commented 5 months ago

more libPhysicsToolsPatAlgos.so libRecoEgammaPhotonIdentification.so pluginGeneratorInterfaceRivetInterface_plugins.so pluginPhysicsToolsNanoAODPlugins.so pluginPhysicsToolsNanoAODRNuplePlugins.so pluginPhysicsToolsPatAlgos_plugins.so pluginPhysicsToolsPatExamples_plugins.so pluginPhysicsToolsScoutingPlugins.so

and few more

VinInn commented 5 months ago

maybe we should set export OMP_NUM_THREADS=1

VinInn commented 5 months ago

so used gdb and info threads it looks to me that each cmssw thread spawn ncpu-1 omp threads all sitting in

 103  Thread 0x7fd468fb2700 (LWP 107236) "cmsRun"          futex_wait (val=32, addr=0x7fd5051e5dc4) at ../../../libgomp/config/linux/x86/futex.h:97

one can break gomp_thread_start and see each of them starting

VinInn commented 5 months ago

export OMP_NUM_THREADS=1 will make gomp_thread_start never called as gomp uses the main thread as thread 0...

dan131riley commented 5 months ago

I thought we already had calls in cmsRun to limit OMP threads. If we don't, we should. I do not like relying on environment variables for this.

makortel commented 5 months ago

The culprit seems to be XGBoost

#0  0x00007ffff51a8e65 in clone () from /lib64/libc.so.6
#1  0x00007ffff553b08f in create_thread () from /lib64/libpthread.so.0
#2  0x00007ffff553ca4b in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00007fff724f9a10 in gomp_team_start (
    fn=fn@entry=0x7fff72922630 <void xgboost::common::ParallelFor<unsigned long, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}>(unsigned long, int, xgboost::common::Sched, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}) [clone ._omp_fn.3]>, data=data@entry=0x7fffffff28a0,
    nthreads=nthreads@entry=16, flags=flags@entry=0, team=0x7fff4d1ac400, taskgroup=taskgroup@entry=0x0) at ../../../libgomp/team.c:858
#4  0x00007fff724f06e1 in GOMP_parallel (
    fn=0x7fff72922630 <void xgboost::common::ParallelFor<unsigned long, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}>(unsigned long, int, xgboost::common::Sched, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}) [clone ._omp_fn.3]>, data=0x7fffffff28a0, num_threads=16, flags=0)
    at ../../../libgomp/parallel.c:176
#5  0x00007fff7291dd7b in void xgboost::common::ParallelFor<unsigned long, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}>(unsigned long, int, xgboost::common::Sched, xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int)::{lambda(unsigned int)#1}) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#6  0x00007fff7291e051 in void xgboost::predictor::PredictBatchByBlockOfRowsKernel<xgboost::predictor::SparsePageView, 64ul>(xgboost::predictor::SparsePageView, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int, std::vector<xgboost::RegTree::FVec, std::allocator<xgboost::RegTree::FVec> >*, int) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#7  0x00007fff729315e6 in xgboost::predictor::CPUPredictor::PredictDMatrix(xgboost::DMatrix*, std::vector<float, std::allocator<float> >*, xgboost::gbm::GBTreeModel const&, int, int) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#8  0x00007fff727b92e8 in xgboost::gbm::GBTree::PredictBatch(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#9  0x00007fff727fe745 in xgboost::LearnerImpl::Predict(std::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, unsigned int, bool, bool, bool, bool, bool) ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#10 0x00007fff7263e843 in XGBoosterPredictFromDMatrix () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libxgboost.so
#11 0x00007fff72b8a5ea in PhotonXGBoostEstimator::computeMva(float, float, float, float, float, float, float, float, float) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libRecoEgammaPhotonIdentification.so
#12 0x00007fff72bf5a6f in PhotonXGBoostProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const ()
   from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginRecoEgammaPhotonIdentificationPlugins.so
mmusich commented 5 months ago

@smorovic FYI

VinInn commented 5 months ago

the number of threads can be set in xgboost https://xgboost.readthedocs.io/en/stable/parameter.html

stil I would prefer to limit the number of OMP thread (to 1!) globally

btw apparently xgboost can run on gpu as well....

WAIT: did we already have this discussion once?

fwyzard commented 5 months ago

The culprit seems to be XGBoost

which was added as a dependency of RecoEgamma/PhotonIdentification in CMSSW_14_0_4/#44500.

makortel commented 5 months ago

The culprit seems to be XGBoost

which was added as a dependency of RecoEgamma/PhotonIdentification in CMSSW_14_0_4/#44500.

Let's tag @cms-sw/ml-l2 too

VinInn commented 5 months ago

with the question: why XGBoost and not cmssw GBTree? For training anything can be used, for inference better to stick to the preferred (and optmized) implementation

mmusich commented 5 months ago

with the question: why XGBoost and not cmssw GBTree?

Was discussed in https://github.com/cms-sw/cmssw/issues/44542#issuecomment-2068155042 and following

VinInn commented 5 months ago

I suggest as immediate action at least at HLT to limit the number of OMP threads to 1!

fwyzard commented 5 months ago

I suggest as immediate action that we remove this crap from CMSSW.

mmusich commented 5 months ago

the number of threads can be set in xgboost https://xgboost.readthedocs.io/en/stable/parameter.html

@smorovic is it something we do directly in the plugin added in https://github.com/cms-sw/cmssw/pull/44500 ? Can you have a look?

fwyzard commented 5 months ago

Note that if XGBoost uses multiple threads, the timing measurements will fail to account for the time it uses.

mmusich commented 5 months ago

would this

diff --git a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
index 8b5cae21c48..4ebb8b6efd3 100644
--- a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
+++ b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
@@ -3,6 +3,9 @@

 PhotonXGBoostEstimator::PhotonXGBoostEstimator(const edm::FileInPath& weightsFile, int best_ntree_limit) {
   XGBoosterCreate(NULL, 0, &booster_);
+  // set number of threads to 1, see https://github.com/cms-sw/cmssw/issues/44923
+  XGBoosterSetParam(booster_, "nthread", "1");
+
   XGBoosterLoadModel(booster_, weightsFile.fullPath().c_str());
   best_ntree_limit_ = best_ntree_limit;

be enough? I tried this

so used gdb and info threads

and I see:

(gdb) info threads
  Id   Target Id                                             Frame 
  1    Thread 0x7ffff469b640 (LWP 1842094) "cmsRun"          __GI__dl_debug_state () at dl-debug.c:74
* 2    Thread 0x7fffb8e9c700 (LWP 1843076) "cmsRun"          0x00007ffff551aab4 in read () from /lib64/libpthread.so.0
  3    Thread 0x7fffb85f6700 (LWP 1843083) "cuda0000340000e" 0x00007ffff5268301 in poll () from /lib64/libc.so.6
  4    Thread 0x7fff9ffff700 (LWP 1843086) "cuda-EvtHandlr"  0x00007ffff5268301 in poll () from /lib64/libc.so.6
  5    Thread 0x7fff5f65b700 (LWP 1843284) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  6    Thread 0x7fff5ee5a700 (LWP 1843285) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  7    Thread 0x7fff5e659700 (LWP 1843286) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  8    Thread 0x7fff5de58700 (LWP 1843287) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  9    Thread 0x7fff5d657700 (LWP 1843288) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  10   Thread 0x7fff5ce56700 (LWP 1843289) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  11   Thread 0x7fff5c655700 (LWP 1843290) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  12   Thread 0x7fff5be54700 (LWP 1843291) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  13   Thread 0x7fff5b653700 (LWP 1843292) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  14   Thread 0x7fff5ae52700 (LWP 1843293) "cmsRun"          0x00007ffff5273247 in epoll_wait () from /lib64/libc.so.6
  15   Thread 0x7fff5a651700 (LWP 1843294) "cmsRun"          0x00007ffff551b180 in nanosleep () from /lib64/libpthread.so.0
  16   Thread 0x7fff59e50700 (LWP 1843295) "cmsRun"          0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  17   Thread 0x7fff5964f700 (LWP 1843296) "cmsRun"          0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  18   Thread 0x7fff58e4e700 (LWP 1843297) "cmsRun"          0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  19   Thread 0x7fff4581a700 (LWP 1843407) "cmsRun"          0x00007ffff551745c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20   Thread 0x7fff45019700 (LWP 1843408) "cmsRun"          0x00007ffff551745c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21   Thread 0x7fff43fff700 (LWP 1843409) "cmsRun"          0x00007ffff551745c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23   Thread 0x7fff298e1700 (LWP 1844089) "cmsRun"          0x00007ffff5519da6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
  24   Thread 0x7fff06a87700 (LWP 1844090) "cmsRun"          0x00007ffff517d9bd in syscall () from /lib64/libc.so.6
smorovic commented 5 months ago

Hello, this doesn't look good as apparently jobs use over 1000 threads. So either we should find a fix quickly or disable those paths (maybe by prescaling paths to 0 or removing from the menu).

Quick search reveals to set this enviroment: os.environ['OMP_NUM_THREADS'] = "1"

XGBoosterSetParam(booster_, "nthread", "1"); suggested by @mmusich could also work, I have to try.

I can patch cmsRun start scripts of hltd to insert the environment and I should find a machine to try this. Generally, it makes moving to GBTree, which I still didn't get to start working on, a priority.

smorovic commented 5 months ago

I added this environment to fu-c2b05-42-01 which belongs to ZDC partition and thus out of the run, and manually attached to one RUBU (c2a11-07-01), it will stay there until the run ends.

ps huH p 3206777 | wc -l

44

this was 1036 on other machines, so it seems to work in limiting threads. I will check in a while in case it doesn't grow.

I would suggest to deploy this script to all FUs for the time being, maybe after this fill (or if dump is late, then in the morning). @fwyzard, do you agree? Later I will try with XGBoosterSetParam.

smorovic commented 5 months ago

would this


diff --git a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
index 8b5cae21c48..4ebb8b6efd3 100644
--- a/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
+++ b/RecoEgamma/PhotonIdentification/src/PhotonXGBoostEstimator.cc
@@ -3,6 +3,9 @@

 PhotonXGBoostEstimator::PhotonXGBoostEstimator(const edm::FileInPath& weightsFile, int best_ntree_limit) {
   XGBoosterCreate(NULL, 0, &booster_);
+  // set number of threads to 1, see https://github.com/cms-sw/cmssw/issues/44923
+  XGBoosterSetParam(booster_, "nthread", "1");
+
   XGBoosterLoadModel(booster_, weightsFile.fullPath().c_str());
   best_ntree_limit_ = best_ntree_limit;

I also tried this, with fileListMode on a non-used HLT node, and it has eqivalent effect as export OMP_NUM_THREAD=1, In both cases there are no "omp" threads running (otherwise there are many).

mmusich commented 5 months ago

I also tried this, with fileListMode on a non-used HLT node, and it has eqivalent effect as export OMP_NUM_THREAD=1, In both cases there are no "omp" threads running (otherwise there are many).

thanks for the test, I opened https://github.com/cms-sw/cmssw/pull/45042. If it's viable we can backport it and create an urgent patch (tagging also @francescobrivio as ORM)

francescobrivio commented 5 months ago

I would suggest to deploy this script to all FUs for the time being, maybe after this fill (or if dump is late, then in the morning). @fwyzard, do you agree?

@smorovic @fwyzard given building a patch release (even if urgent) might take few days, especially considering it's the weekend, will you apply the patch on hltd today/tomorrow? (Even tho IIUC there is no extreme urgency from the operational point of view)

smorovic commented 5 months ago

I would suggest to deploy this script to all FUs for the time being, maybe after this fill (or if dump is late, then in the morning). @fwyzard, do you agree?

@smorovic @fwyzard given building a patch release (even if urgent) might take few days, especially considering it's the weekend, will you apply the patch on hltd today/tomorrow? (Even tho IIUC there is no extreme urgency from the operational point of view)

We deployed the environment variable in the startup script a few minutes ago (from run 381188), we can keep it until the patch release.

francescobrivio commented 5 months ago

Thanks a lot @smorovic!