cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

HLT crashes in Run 380399 #44923

Closed trtomei closed 4 weeks ago

trtomei commented 4 months ago

Crashes observed in collisions Run 380399. Stack traces:

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:33:54 CEST 2024
Thread 1 (Thread 0x7f9b37c4a640 (LWP 552137) "cmsRun"):
#0 0x00007f9b388160e1 in poll () from /lib64/libc.so.6
#1 0x00007f9b2eb792ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007f9b2eb2cafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007f9b2eb2d460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9b38ac882b in __lll_lock_wait () from /lib64/libpthread.so.0
#6 0x00007f9b38ac1ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#7 0x00007f9ad36b5266 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt::tryReuseCachedBlock(cms::alpakatools::CachingAllocator >::BlockDescriptor&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#8 0x00007f9ad36b641f in alpaka_cuda_async::EcalRawToDigiPortable::produce(alpaka_cuda_async::device::Event&, alpaka_cuda_async::device::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#9 0x00007f9ad36b1f73 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#10 0x00007f9b3b27a47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007f9b3b25ec2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007f9b3b1e6f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007f9b3b1e74c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007f9b3b157bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007f9b39992281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f9b3b16841b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007f9b3b17224d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00007f9b3b1727b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f9b3997e9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:25:57 CEST 2024
Thread 1 (Thread 0x7fcd588eb640 (LWP 2367851) "cmsRun"):
#0 0x00007fcd594b70e1 in poll () from /lib64/libc.so.6
#1 0x00007fcd4f8092ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007fcd4f7bcafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007fcd4f7bd460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fcd59765020 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#6 0x00007fcd3ae57e86 in ?? () from /lib64/libcuda.so.1
#7 0x00007fcd3ab754d7 in ?? () from /lib64/libcuda.so.1
#8 0x00007fcd3ac46009 in ?? () from /lib64/libcuda.so.1
#9 0x00007fcd3c7b41b7 in ?? () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#10 0x00007fcd3c7f0490 in cudaEventQuery () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#11 0x00007fcd464a00a5 in cms::alpakatools::EventCache<alpaka::EventUniformCudaHipRt::get(alpaka::DevUniformCudaHipRt) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#12 0x00007fcd464a436b in alpaka_cuda_async::detail::EDMetadataSentry::EDMetadataSentry(edm::StreamID) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#13 0x00007fccf4392ed5 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#14 0x00007fcd5bf1b47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fcd5beffc2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fcd5be87f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fcd5be884c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fcd5bdf8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fcd5a633281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#22 0x00007fcd5be0941b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fcd5be1324d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00007fcd5be137b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#25 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#26 0x00007fcd5a61f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#27 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#28 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

We tried to reproduce it with the following recipe, but it didn't reproduce.

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380399 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000145_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000225_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000211_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000280_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000281_fu-c2b01-28-01_pid552137.root'\
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Message #8 in the first stack trace seems to point to alpaka_cuda_async::EcalRawToDigiPortable::produce() method.

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

Best regards, Thiago (for FOG)

fwyzard commented 3 months ago

Is it possible that alpaka::memset(event.queue(), hostProduct.buffer(), 0xFF); is NOT synchronous? (which does not make sense for the host buffer)

Very good point.

Yes, if the buffer is in pinned host memory in preparation for a GPU copy, the queue will be the GPU one, and the operation is potentially asynchronous.

fwyzard commented 3 months ago

But this should not change the behaviour of the ESProducer apart from the DEBUG statements.

fwyzard commented 3 months ago

I mean - maybe it does change the behaviour, but then it's a bug and we should fix it.

jsamudio commented 3 months ago

I have been investigating with the new Alpaka HCAL local reco.

Start with CMSSW_14_0_9_patch2_MULTIARCHS + #45277 + #45278 + #45324 + #45342 + #45210.

Use the same script as in https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199709930, although append the following to the configs as well:

from HLTrigger.Configuration.customizeHLTforAlpaka import customizeHLTforAlpakaPFSoA
from HLTrigger.Configuration.customizeHLTforAlpaka import customizeHLTforAlpakaHcalLocalReco

process = customizeHLTforAlpakaHcalLocalReco(process)
process = customizeHLTforAlpakaPFSoA(process)

Now crashing with:

----- Begin Fatal Exception 03-Jul-2024 10:29:09 CEST-----------------------
An exception of category 'StdException' occurred while
  [0] Processing  Event run: 380399 lumi: 123 event: 121628668 stream: 0
  [1] Running path 'HLT_Mu12_DoublePFJets54MaxDeta1p6_PNet2BTag_0p11_v2'
  [2] Calling method for module HBHERecHitProducerPortable@alpaka/'hltHbheRecoSoA'
  Exception Message:
  A std::exception was thrown.
  /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_9_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_9_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())'              returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access wasencountered'!
----- End Fatal Exception -------------------------------------------------

Here is the log for one of the runs: hlt_run380399.log

And the corresponding compute-sanitizer --tool memcheck log: memcheck.log

mmusich commented 3 months ago

Start with CMSSW_14_0_9_patch2_MULTIARCHS + https://github.com/cms-sw/cmssw/pull/45277 + https://github.com/cms-sw/cmssw/pull/45278 + https://github.com/cms-sw/cmssw/pull/45324 + https://github.com/cms-sw/cmssw/pull/45342 + https://github.com/cms-sw/cmssw/pull/45210.

Just for the record, a good bunch of this (excepted the alpaka Hcal local reco PR) will be included in CMSSW_14_0_10_MULTIARCHS, which is built but not yet uploaded.

VinInn commented 3 months ago

I do not understand why memset of the host-buffer should be asynchronous given that filling it (as in the loop in the producers) is by definition synchronous.... (I mean, one can make things complicated and schedule it as a cpu-function on the gpu queue but 1) is not what we do, 2) what's the gain?

Anyhow at the moment the memset in the cacheAllocator is buggy as it may set the memory after it is correctly filled by the producer

fwyzard commented 3 months ago

I do not understand why memset of the host-buffer should be asynchronous given that filling it (as in the loop in the producers) is by definition synchronous....

Technically: because, for pinned host memory, the allocation itself is potentially asynchronous.

In practice: the asynchronous allocation are actually disabled for host-side allocator that uses a device queue, so there is actually a bug in how the memset is implemented for this specific case.

(I mean, one can make things complicated and schedule it as a cpu-function on the gpu queue but 1) is not what we do, 2) what's the gain?)

Correct.

Anyhow at the moment the memset in the cacheAllocator is buggy as it may set the memory after it is correctly filled by the producer

I agree, and will prepare a fix.

fwyzard commented 3 months ago

@VinInn do you think https://github.com/cms-sw/cmssw/pull/45368 fixes this problem ?

A more efficient solution would be to make the memset itself non-asynchronous; I'm not sure if that can be done easily when the queue associated with the allocation is asynchronous, though.

VinInn commented 3 months ago

with the fix in #45368 no more crashes are observed. I run the full script. This also implies that the crash at the HLT farm are not easy to reproduce or emulate.

mmusich commented 2 months ago

dear all, I think am a bit lost with what's the current (as of CMSSW_14_0_12) expectation in terms of crashes when running with the options to fill with junk memory the host and device allocators (cf https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199627075). Is the error at https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199586926 expected to be cured in CMSSW_14_0_12 ? @jsamudio

jsamudio commented 2 months ago

@mmusich I am also not sure where things stand currently, I was under the impression that #45368 was going to be the answer to the crashes. I guess this not the case?

fwyzard commented 2 months ago

The various PR implement and fix the possibility of filling memory with zero or junk values, that may ne helpful for debugging.

Regular workflows are non affected either way.

mmusich commented 2 months ago

The various PR implement and fix the possibility of filling memory with zero or junk values, that may ne helpful for debugging.

right. But do you expect that when explicitly filling memory with zero or junk values in CMSSW_14_0_12 and running the current HLT menu (V1.3) over recent data to have still crashes or not?

fwyzard commented 2 months ago

I don't know.

According to the test that Vincenzo did, I don't expect it to fix the crashes.

mmusich commented 2 months ago

I don't know. According to the test that Vincenzo did, I don't expect it to fix the crashes.

I see. The reason why I ask is basically https://github.com/cms-sw/cmssw/issues/45555#issuecomment-2250084953.

missirol commented 2 months ago

maybe we should change the assert is some form of logwarning with more details (full dump of vectors?) (just a printf, but careful with the size of the output)

Coming back to this suggestion, since this issue is still unsolved and causing crashes (approx. a few per week), would it make sense to integrate the following change (to have a bit more info in the log files when there is a crash) ?

(if we just demote the assert to a warning, I fear the warning might be missed, since I don't think we systematically check the logs of HLT jobs that don't crash)

diff --git a/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h b/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h
index a8c428e2f5a..2f78723e61d 100644
--- a/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h
+++ b/RecoTracker/PixelVertexFinding/plugins/alpaka/fitVertices.h
@@ -74,7 +74,19 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::vertexFinder {
     alpaka::syncBlockThreads(acc);
     // reuse nn
     for (auto i : cms::alpakatools::uniform_elements(acc, foundClusters)) {
-      ALPAKA_ASSERT_ACC(wv[i] > 0.f);
+      bool const wv_cond = (wv[i] > 0.f);
+      if (not wv_cond) {
+        printf("ERROR: wv[%d] (%f) > 0.f failed\n", i, wv[i]);
+        // printing info on tracks associated to this vertex
+        for (auto trk_i = 0u; trk_i < nt; ++trk_i) {
+          if (iv[trk_i] != int(i)) {
+            continue;
+          }
+          printf("   iv[%d]=%d zt[%d]=%f ezt2[%d]=%f\n", trk_i, iv[trk_i], trk_i, zt[trk_i], trk_i, ezt2[trk_i]);
+        }
+        ALPAKA_ASSERT_ACC(false);
+      }
+
       zv[i] /= wv[i];
       nn[i] = -1;  // ndof
     }
mmusich commented 2 months ago

Coming back to this suggestion, since this issue is still unsolved and causing crashes (approx. a few per week), would it make sense to integrate the following change (to have a bit more info in the log files when there is a crash) ?

as discussed at the last TSG meeting, I think that's a possible way forward. Opened:

missirol commented 1 month ago

Thanks @mmusich !

fwyzard commented 1 month ago

compute-sanitizer --tool=racecheck --racecheck-report=all does report potential "RAW" (read-after-write) and "WAR" (write-after-read) in RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:

========= Error: Potential WAR hazard detected at __shared__ 0xafac in block (0,0,0) :
=========     Read Thread (479,0,0) at 0x6d30 in /data/user/fwyzard/issue44923/CMSSW_14_0_13_patch1_MULTIARCHS/src/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:148:void alpaka_cuda_async::vertexFinder::splitVertices<alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, (unsigned long)1>, unsigned int>>(const T1 &, reco::ZVertexLayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, vertexFinder::PixelVertexWSSoALayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, float)
=========     Write Thread (352,0,0) at 0x5db0 in /data/user/fwyzard/issue44923/CMSSW_14_0_13_patch1_MULTIARCHS/src/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h:69:void alpaka_cuda_async::vertexFinder::splitVertices<alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, (unsigned long)1>, unsigned int>>(const T1 &, reco::ZVertexLayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, vertexFinder::PixelVertexWSSoALayout<(unsigned long)128, (bool)0>::ViewTemplateFreeParams<(unsigned long)128, (bool)0, (bool)1, (bool)1> &, float)
=========     Current Value : 0, Incoming Value : 0
=========     ...
========= 

Looking at the two lines of RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h (line 69 and line 148) I'm thinking that the problem may happen when this loop rolls over without any synchronisation, and one thread sets nq = 0 on line 69 while another is still looping over it on line 148.

Adding

diff --git a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
index e2ba0b46b8be..be3b20563663 100644
--- a/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
+++ b/RecoTracker/PixelVertexFinding/plugins/alpaka/splitVertices.h
@@ -150,6 +150,8 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::vertexFinder {
           iv[it[k]] = igv;
       }

+      // synchronise the threads before starting the next iteration of the loop of the groups
+      alpaka::syncBlockThreads(acc);
     }  // loop on vertices
   }

before the end of the loop seems to make racecheck happy (well, it makes it complain about a different piece of code in alpaka_cuda_async::pixelClustering::FindClus)

mmusich commented 1 month ago

assign hlt

cmsbuild commented 1 month ago

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich commented 1 month ago

Proposed fixes:

45655 was included in CMSSW_14_0_14_MULTIARCHS which was deployed on Aug 12, 2024 (see e-log: http://cmsonline.cern.ch/cms-elog/1230042) during run-384365.

No crashes of this type have been observed (so far) in the following physics fill 9996

mmusich commented 1 month ago

+hlt

makortel commented 1 month ago

+heterogeneous

mmusich commented 1 month ago

@cms-sw/reconstruction-l2 please consider signing this if there is no other follow up from your area, such that we could close this issue.

jfernan2 commented 4 weeks ago

+1

cmsbuild commented 4 weeks ago

This issue is fully signed and ready to be closed.

makortel commented 4 weeks ago

@cmsbuild, please close