cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

HLT crashes in Run 380399 #44923

Closed trtomei closed 4 weeks ago

trtomei commented 4 months ago

Crashes observed in collisions Run 380399. Stack traces:

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:33:54 CEST 2024
Thread 1 (Thread 0x7f9b37c4a640 (LWP 552137) "cmsRun"):
#0 0x00007f9b388160e1 in poll () from /lib64/libc.so.6
#1 0x00007f9b2eb792ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007f9b2eb2cafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007f9b2eb2d460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007f9b38ac882b in __lll_lock_wait () from /lib64/libpthread.so.0
#6 0x00007f9b38ac1ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#7 0x00007f9ad36b5266 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt::tryReuseCachedBlock(cms::alpakatools::CachingAllocator >::BlockDescriptor&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#8 0x00007f9ad36b641f in alpaka_cuda_async::EcalRawToDigiPortable::produce(alpaka_cuda_async::device::Event&, alpaka_cuda_async::device::EventSetup const&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#9 0x00007f9ad36b1f73 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#10 0x00007f9b3b27a47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007f9b3b25ec2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007f9b3b1e6f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007f9b3b1e74c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#14 0x00007f9b3b157bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007f9b39992281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7f9b34120480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#18 0x00007f9b3b16841b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007f9b3b17224d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#20 0x00007f9b3b1727b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#21 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#22 0x00007f9b3997e9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#23 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#24 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

Mon May 6 09:25:57 CEST 2024
Thread 1 (Thread 0x7fcd588eb640 (LWP 2367851) "cmsRun"):
#0 0x00007fcd594b70e1 in poll () from /lib64/libc.so.6
#1 0x00007fcd4f8092ff in full_read.constprop () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2 0x00007fcd4f7bcafc in edm::service::InitRootHandlers::stacktraceFromThread() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3 0x00007fcd4f7bd460 in sig_dostack_then_abort () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4 <signal handler called>
#5 0x00007fcd59765020 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#6 0x00007fcd3ae57e86 in ?? () from /lib64/libcuda.so.1
#7 0x00007fcd3ab754d7 in ?? () from /lib64/libcuda.so.1
#8 0x00007fcd3ac46009 in ?? () from /lib64/libcuda.so.1
#9 0x00007fcd3c7b41b7 in ?? () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#10 0x00007fcd3c7f0490 in cudaEventQuery () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/libcudart.so.12
#11 0x00007fcd464a00a5 in cms::alpakatools::EventCache<alpaka::EventUniformCudaHipRt::get(alpaka::DevUniformCudaHipRt) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#12 0x00007fcd464a436b in alpaka_cuda_async::detail::EDMetadataSentry::EDMetadataSentry(edm::StreamID) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/libHeterogeneousCoreAlpakaCoreCudaAsync.so
#13 0x00007fccf4392ed5 in alpaka_cuda_async::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) [clone .lto_priv.0] () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/pluginEventFilterEcalRawToDigiPluginsPortableCudaAsync.so
#14 0x00007fcd5bf1b47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#15 0x00007fcd5beffc2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fcd5be87f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits::Context const*) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fcd5be884c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits::execute() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fcd5bdf8bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00007fcd5a633281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#20 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=, this=0x7fcd54db0480) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#21 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#22 0x00007fcd5be0941b in edm::FinalWaitingTask::wait() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#23 0x00007fcd5be1324d in edm::EventProcessor::processRuns() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#24 0x00007fcd5be137b1 in edm::EventProcessor::runToCompletion() () from /opt/offline/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#25 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#26 0x00007fcd5a61f9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#27 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#28 0x000000000040517c in main ()
[ message truncated - showing only crashed thread ] 

We tried to reproduce it with the following recipe, but it didn't reproduce.

#!/bin/bash -ex

# CMSSW_14_0_6_MULTIARCHS

hltGetConfiguration run:380399 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--max-events -1 \
--input \
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000145_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0123_index000225_fu-c2b03-08-01_pid2367851.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000211_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000280_fu-c2b01-28-01_pid552137.root,'\
'/store/group/tsg/FOG/debug/240507_run380399/run380399_ls0323_index000281_fu-c2b01-28-01_pid552137.root'\
> hlt.py

cat <<@EOF >> hlt.py
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

cmsRun hlt.py &> hlt.log

Message #8 in the first stack trace seems to point to alpaka_cuda_async::EcalRawToDigiPortable::produce() method.

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

Best regards, Thiago (for FOG)

trocino commented 4 months ago

There was another occurrence in run 381190. I attach the full log file here: old_hlt_run381190_pid3537775.log

We will request the error stream files for testing, unless you don't think they are needed anymore.

smorovic commented 4 months ago

I killed that process after seeing it was stuck in cuda code. U

There was another occurrence in run 381190. I attach the full log file here: old_hlt_run381190_pid3537775.log

We will request the error stream files for testing, unless you don't think they are needed anymore.

This one also had to be killed manually.

missirol commented 4 months ago

Just reporting another instance of the same issue in run-381286 (assert, job stuck, then killed manually). old_hlt_run381286_pid3340182.log

missirol commented 4 months ago

Just reporting (what I think are) two more instances of this issue.

old_hlt_run381443_pid245753.log old_hlt_run381479_pid1861586.log

missirol commented 4 months ago

Just reporting another instance of the same issue in run-381417 (assert, job stuck, then killed manually). old_hlt_run381417_pid191991.log

VinInn commented 4 months ago

maybe we should change the assert is some form of logwarning with more details (full dump of vectors?) (just a printf, but careful with the size of the output)

trocino commented 4 months ago

Another instance in run 381543. old_hlt_run381543_pid3462225.log

trocino commented 3 months ago

Another one from run 381544. old_hlt_run381544_pid668710.log

VinInn commented 3 months ago

I do not see how we can progress on this w/o some sort of instrumentation of the production code. If at least #44956 was solved we could have tried to set memory to junk...

mmusich commented 3 months ago

As the thread has become a bit chaotic with several off-topics, just to keep record of all the instances of this type of crash so far, here's a complete list, with annotations if the job crashed or was killed manually after assertion and getting stuck (when this is known):

@trtomei as you own this issue, you might want to update the issue description.

mmusich commented 3 months ago

If at least https://github.com/cms-sw/cmssw/issues/44956 was solved we could have tried to set memory to junk...

https://github.com/cms-sw/cmssw/issues/44956 will be solved by https://github.com/cms-sw/cmssw/pull/45210. I have now tried to set memory to junk for all the known cases that we have record of (see https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2165787809) after applying the changes by https://github.com/cms-sw/cmssw/pull/45210 on top of CMSSW_14_0_9_MULTIARCHS using this script:

#!/bin/bash -ex

# List of run numbers
runs=(
  380399
  380624
  381067
  381190
  381286
  381443
  381479
  381417
  381543
  381544
)

# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration run:${run} \
    --globaltag ${global_tag} \
    --data \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

  # Run the HLT configuration with cmsRun and redirect output to log file
  cmsRun ${hlt_config_file} &> ${hlt_log_file}

done

unfortunately I still don't get any hints from that.

missirol commented 3 months ago

Just reporting three more instances of the same issue in run-382250 (1 crash) and run-382258 (2 crashes). old_hlt_run382250_pid2754657.log old_hlt_run382258_pid3077523.log old_hlt_run382258_pid3186816.log

VinInn commented 3 months ago

we need a mechanism to set the GPU memory to junk each time the "allocator" return a block

fwyzard commented 3 months ago

That is technically easy to do. Do you expect it would help with the crashes ? Or with the debugging ?

For debugging you could try

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 2560361e796..1f6fab26e44 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -186,6 +186,9 @@ namespace cms::alpakatools {
         allocateNewBlock(block);
       }

+      // fill the re-used or newly allocated memory block
+      alpaka::memset(block.queue, block.buffer, 0xa5);
+
       return block.buffer->data();
     }
mmusich commented 3 months ago

For debugging you could try

is this supposed to compile out of the box? I get:

/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/view/Traits.hpp:247:77: error: no matching function for call to 'getExtents'
        enqueue(queue, createTaskMemset(std::forward<TViewFwd>(view), byte, getExtents(view)));
                                                                            ^~~~~~~~~~
src/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h:190:15: note: in instantiation of function template specialization 'alpaka::memset<std::optional<alpaka::BufCpu<std::byte, std::integral_constant<unsigned long, 1>, unsigned long>> &, std::optional<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiHipRt, false>>>' requested here
      alpaka::memset(block.queue, block.buffer, 0xa5);
              ^
fwyzard commented 3 months ago

It was supposed to, but looks like I missed a step:

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 2560361e796..1f6fab26e44 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -186,6 +186,9 @@ namespace cms::alpakatools {
         allocateNewBlock(block);
       }

+      // fill the re-used or newly allocated memory block
+      alpaka::memset(block.queue, block.buffer.value(), 0xa5);
+
       return block.buffer->data();
     }
fwyzard commented 3 months ago

OK, this time I actually compiled it...

      // fill the re-used or newly allocated memory block
      alpaka::memset(*block.queue, *block.buffer, 0xa5);

I'm working on a PR to enable it as an option in the AlpakaService

missirol commented 3 months ago

Just reporting the logs of 6 more HLT jobs that got stuck, and were then killed, a few days ago.

[1] does not mention the assert discussed in this issue; not sure whether or not it got stuck for a different reason.

Unfortunately, this time we do not have the error files for these 6 crashes.

[1] old_hlt_run382299_pid1842918.log [2] old_hlt_run382300_pid1034238.log [3] old_hlt_run382300_pid3213702.log [4] old_hlt_run382314_pid3350560.log [5] old_hlt_run382344_pid2465490.log [6] old_hlt_run382344_pid2582026.log


Edit (Jun-30): just for the record, 4 more occurrences in run-382580, and 1 more in run-382594 (this time, all 5 show the usual assert message).

old_hlt_run382580_pid703431.log old_hlt_run382580_pid703928.log old_hlt_run382580_pid1080104.log old_hlt_run382580_pid3400326.log

old_hlt_run382594_pid3864027.log

fwyzard commented 3 months ago

OK, this time I actually compiled it...

      // fill the re-used or newly allocated memory block
      alpaka::memset(*block.queue, *block.buffer.value, 0xa5);

I'm working on a PR to enable it as an option in the AlpakaService

And finally, here they are: https://github.com/cms-sw/cmssw/pull/45341 (14.1.x) / https://github.com/cms-sw/cmssw/pull/45342 (14.0.x) .

To fill the NVIDIA GPU memory before every allocation or reuse with 0xA5, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True

To fill the NVIDIA GPU memory before every deallocation or caching with 0x5A, you can now use

process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True

To use different values and combination for allocations, deallocation, caching, and reuse, the full options are

process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillAllocationValue = 0xA5,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillReallocationValue = 0x69,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocations = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillDeallocationValue = 0x5A,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCaches = True,
process.AlpakaServiceCudaAsync.deviceAllocator.fillCacheValue = 0x96

To do the same for the pinned host memory used in the GPU transfers, process.AlpakaServiceCudaAsync.hostAllocator accepts the same options.

To do the same for the CPU memory used by the alpaka modules running on the host, replace AlpakaServiceCudaAsync with AlpakaServiceSerialSync.

VinInn commented 3 months ago

got

AttributeError: 'Process' object has no attribute 'AlpakaServiceCudaAsync'
fwyzard commented 3 months ago

Eh, I guess it depends what configuration you are running ?

See the tests for an example.

VinInn commented 3 months ago

I used the scripts https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2166038461 posted by @mmusich above. Most probably we are still running the CUDA config. No idea how to convert it to ALPAKA

mmusich commented 3 months ago

Most probably we are still running the CUDA config. No idea how to convert it to ALPAKA.

Nope. Only Hcal is still CUDA. The rest is alpaka.

VinInn commented 3 months ago

added the service as in the test now I get

[innocent@lxplus800 innocent]$ cat  hlt_run380399.log
%MSG-i CUDAService:  (NoModuleName) 01-Jul-2024 10:34:16 CEST pre-events
CUDA runtime version 12.2, driver version 12.4, NVIDIA driver version 550.90.07
CUDA device 0: NVIDIA A100-PCIE-40GB (sm_80)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 01-Jul-2024 10:34:16 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - NVIDIA A100-PCIE-40GB
%MSG
CachingAllocator settings
  bin growth 2
  min bin    8
  max bin    30
  resulting bins:
         256   B
         512   B
           1 KiB
           2 KiB
           4 KiB
           8 KiB
          16 KiB
          32 KiB
          64 KiB
         128 KiB
         256 KiB
         512 KiB
           1 MiB
           2 MiB
           4 MiB
           8 MiB
          16 MiB
          32 MiB
          64 MiB
         128 MiB
         256 MiB
         512 MiB
           1 GiB
  maximum amount of cached memory: 64 GiB
CachingAllocator settings
  bin growth 2
  min bin    8
  max bin    30
  resulting bins:
         256   B
         512   B
           1 KiB
           2 KiB
           4 KiB
           8 KiB
          16 KiB
          32 KiB
          64 KiB
         128 KiB
         256 KiB
         512 KiB
           1 MiB
           2 MiB
           4 MiB
           8 MiB
          16 MiB
          32 MiB
          64 MiB
         128 MiB
         256 MiB
         512 MiB
           1 GiB
  maximum amount of cached memory: 8 GiB
%MSG-i AlpakaService:  (NoModuleName) 01-Jul-2024 10:34:16 CEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - AMD EPYC 7313 16-Core Processor
%MSG
01-Jul-2024 10:34:24 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root
01-Jul-2024 10:34:27 CEST  Successfully opened file root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root
Begin processing the 1st record. Run 380399, Event 121628654, LumiSection 123 on stream 0 at 01-Jul-2024 10:35:18.294 CEST
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18ba000000 (16777216 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3762790.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor allocated new block at 0x7f194b2a9200 (131072 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3762700.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor allocated new block at 0x7f194b2c9200 (256 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3761ad0.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor allocated new block at 0x7f194b2c9400 (256 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3761aa0.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb000000 (2097152 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3761a40.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb200000 (524288 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3760de0.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb280000 (131072 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3760db0.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb2a0000 (256 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3760d80.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb2a0200 (256 bytes associated with queue 0x7f18df6ed010, event 0x7f18c3760150.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB returned 256 bytes at 0x7f18bb2a0200 from associated queue 0x7f18df6ed010 , event 0x7f18c3760150 .
         1 available blocks cached (256 bytes), 5 live blocks (19529984 bytes) outstanding.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB returned 256 bytes at 0x7f18bb2a0000 from associated queue 0x7f18df6ed010 , event 0x7f18c3760d80 .
         2 available blocks cached (512 bytes), 4 live blocks (19529728 bytes) outstanding.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB returned 131072 bytes at 0x7f18bb280000 from associated queue 0x7f18df6ed010 , event 0x7f18c3760db0 .
         3 available blocks cached (131584 bytes), 3 live blocks (19398656 bytes) outstanding.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor returned 256 bytes at 0x7f194b2c9400 from associated queue 0x7f18df6ed010 , event 0x7f18c3761aa0 .
         1 available blocks cached (256 bytes), 2 live blocks (131328 bytes) outstanding.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor returned 256 bytes at 0x7f194b2c9200 from associated queue 0x7f18df6ed010 , event 0x7f18c3761ad0 .
         2 available blocks cached (512 bytes), 1 live blocks (131072 bytes) outstanding.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor returned 131072 bytes at 0x7f194b2a9200 from associated queue 0x7f18df6ed010 , event 0x7f18c3762700 .
         3 available blocks cached (131584 bytes), 0 live blocks (0 bytes) outstanding.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor allocated new block at 0x7f18b8c00000 (2097152 bytes associated with queue 0x7f18df6ed010, event 0x7f18c375e7d0.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor allocated new block at 0x7f194b2c9600 (524288 bytes associated with queue 0x7f18c3361c10, event 0x7f18c375db10.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb2a0400 (512 bytes associated with queue 0x7f18c3361b10, event 0x7f18c375dab0.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor reused cached block at 0x7f194b2c9400 (256 bytes) for queue 0x7f18c3361b10, event 0x7f18c3761aa0 (previously associated with queue 0x7f18df6ed010 , event 0x7f18c3761aa0).

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB reused cached block at 0x7f18bb2a0200 (256 bytes) for queue 0x7f18c3361b10, event 0x7f18c3760150 (previously associated with queue 0x7f18df6ed010 , event 0x7f18c3760150).

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor returned 256 bytes at 0x7f194b2c9400 from associated queue 0x7f18c3361b10 , event 0x7f18c3761aa0 .
         3 available blocks cached (131584 bytes), 2 live blocks (2621440 bytes) outstanding.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB reused cached block at 0x7f18bb2a0000 (256 bytes) for queue 0x7f18c3361c10, event 0x7f18c3760d80 (previously associated with queue 0x7f18df6ed010 , event 0x7f18c3760d80).

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bba00000 (2097152 bytes associated with queue 0x7f18c3361c10, event 0x7f189ee6d950.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb3f9000 (8192 bytes associated with queue 0x7f18c3361c10, event 0x7f189ee6d980.

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor reused cached block at 0x7f194b2a9200 (131072 bytes) for queue 0x7f18c3361c10, event 0x7f18c3762700 (previously associated with queue 0x7f18df6ed010 , event 0x7f18c3762700).

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB reused cached block at 0x7f18bb280000 (131072 bytes) for queue 0x7f18c3361c10, event 0x7f18c3760db0 (previously associated with queue 0x7f18df6ed010 , event 0x7f18c3760db0).

    alpaka::DevCpu AMD EPYC 7313 16-Core Processor returned 131072 bytes at 0x7f194b2a9200 from associated queue 0x7f18c3361c10 , event 0x7f18c3762700 .
         3 available blocks cached (131584 bytes), 2 live blocks (2621440 bytes) outstanding.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18b9cc0000 (524288 bytes associated with queue 0x7f18c3361c10, event 0x7f189ee6fee0.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18b9d40000 (131072 bytes associated with queue 0x7f18c3361c10, event 0x7f189ee6feb0.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB allocated new block at 0x7f18bb3fb000 (256 bytes associated with queue 0x7f18c3361c10, event 0x7f189ee6fe80.

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB returned 256 bytes at 0x7f18bb3fb000 from associated queue 0x7f18c3361c10 , event 0x7f189ee6fe80 .
         1 available blocks cached (256 bytes), 11 live blocks (22291456 bytes) outstanding.

CachingAllocator::free() caught an alpaka error: /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/uniformCudaHip/Set.hpp(110) 'TApi::memsetAsync( getPtrNative(view), static_cast<int>(this->m_byte), extentWidthBytes, queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB freed 131072 bytes at 0x7f18b9d40000 from associated queue 0x7f18c3361c10, event 0x7f189ee6feb0 .
         1 available blocks cached (256 bytes), 10 live blocks (22160384 bytes) outstanding.

/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB failed to allocate 8388608 bytes for queue 0x7f189ee1d310, retrying after freeing cached allocations

CachingAllocator::free() caught an alpaka error: /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/uniformCudaHip/Set.hpp(110) 'TApi::memsetAsync( getPtrNative(view), static_cast<int>(this->m_byte), extentWidthBytes, queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB freed 524288 bytes at 0x7f18b9cc0000 from associated queue 0x7f18c3361c10, event 0x7f189ee6fee0 .
         1 available blocks cached (256 bytes), 9 live blocks (21636096 bytes) outstanding.

/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB freed 256 bytes.
          0 available blocks cached (0 bytes), 9 live blocks (21636096 bytes) outstanding.

/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB failed to allocate 2097152 bytes for queue 0x7f189ee1d310, retrying after freeing cached allocations

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB failed to allocate 524288 bytes for queue 0x7f189ee1d310, retrying after freeing cached allocations

    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB failed to allocate 67108864 bytes for queue 0x7f189ee1d310, retrying after freeing cached allocations

----- Begin Fatal Exception 01-Jul-2024 10:35:19 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 380399 lumi: 123 event: 121628654 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module PFRecHitSoAProducerHCAL@alpaka/'hltParticleFlowRecHitHBHESoA'
Exception Message:
A std::exception was thrown.
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_9_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_9_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(160) 'TApi::eventRecord(event.getNativeHandle(), queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------
01-Jul-2024 10:35:19 CEST  Closed file root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/error_stream_root/run380399/run380399_ls0123_index000130_fu-c2b03-08-01_pid2367851.root
CachingAllocator::free() caught an alpaka error: /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/uniformCudaHip/Set.hpp(110) 'TApi::memsetAsync( getPtrNative(view), static_cast<int>(this->m_byte), extentWidthBytes, queue.getNativeHandle())' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
    alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB freed 2097152 bytes at 0x7f18bba00000 from associated queue 0x7f18c3361c10, event 0x7f189ee6d950 .
         0 available blocks cached (0 bytes), 8 live blocks (19538944 bytes) outstanding.

/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
terminate called after throwing an instance of 'std::runtime_error'
  what():
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorIllegalAddress: an illegal memory access was encountered
mmusich commented 3 months ago

@cms-sw/pf-l2 FYI (@waredjeb @jsamudio )

VinInn commented 3 months ago

confirmed that adding

process.MessageLogger.CUDAService = {}
process.MessageLogger.AlpakaService = {}

process.load('HeterogeneousCore.CUDAServices.CUDAService_cfi')

from HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi import AlpakaServiceCudaAsync as _AlpakaServiceCudaAsync
process.AlpakaServiceCudaAsync = _AlpakaServiceCudaAsync.clone(
    verbose = True,
    hostAllocator = dict(
      binGrowth = 2,
      minBin = 8,                           # 256 bytes
      maxBin = 30,                          #   1 GB
      maxCachedBytes = 64*1024*1024*1024,   #  64 GB
      maxCachedFraction = 0.8,              # or 80%, whatever is less
      fillAllocations = True,
      fillAllocationValue = 0xA5,
      fillReallocations = True,
      fillReallocationValue = 0x69,
      fillDeallocations = True,
      fillDeallocationValue = 0x5A,
      fillCaches = True,
      fillCacheValue = 0x96
    ),
    deviceAllocator = dict(
      binGrowth = 2,
      minBin = 8,                           # 256 bytes
      maxBin = 30,                          #   1 GB
      maxCachedBytes = 8*1024*1024*1024,    #   8 GB
      maxCachedFraction = 0.8,              # or 80%, whatever is less
      fillAllocations = True,
      fillAllocationValue = 0xA5,
      fillReallocations = True,
      fillReallocationValue = 0x69,
      fillDeallocations = True,
      fillDeallocationValue = 0x5A,
      fillCaches = True,
      fillCacheValue = 0x96
    )
)

makes the crash above to occur

I removed export MALLOC_CONF=junk:true just to be sure

mmusich commented 3 months ago

Just sharing the full recipe below:

Then use:

#!/bin/bash -ex

# List of run numbers
runs=(
  380399
  380624
  381067
  381190
  381286
  381443
  381479
  381417
  381543
  381544
)

# Base directory for input files on EOS
base_dir="/store/group/tsg/FOG/error_stream_root/run"

# Global tag for the HLT configuration
global_tag="140X_dataRun3_HLT_v3"

# EOS command (adjust this if necessary for your environment)
eos_cmd="eos"

# Loop over each run number
for run in "${runs[@]}"; do
  # Set the MALLOC_CONF environment variable
  # export MALLOC_CONF=junk:true

  # Construct the input directory path
  input_dir="${base_dir}${run}"

  # Find all root files in the input directory on EOS
  root_files=$(${eos_cmd} find -f "/eos/cms${input_dir}" -name "*.root" | awk '{print "root://eoscms.cern.ch/" $0}' | paste -sd, -)

  # Check if there are any root files found
  if [ -z "${root_files}" ]; then
    echo "No root files found for run ${run} in directory ${input_dir}."
    continue
  fi

  # Create filenames for the HLT configuration and log file
  hlt_config_file="hlt_run${run}.py"
  hlt_log_file="hlt_run${run}.log"

  # Generate the HLT configuration file
  hltGetConfiguration run:${run} \
    --globaltag ${global_tag} \
    --data \
    --no-prescale \
    --no-output \
    --max-events -1 \
    --input ${root_files} > ${hlt_config_file}

  # Append additional options to the configuration file
  cat <<@EOF >> ${hlt_config_file}
del process.MessageLogger
process.load('FWCore.MessageService.MessageLogger_cfi')  
process.MessageLogger.CUDAService = {}
process.MessageLogger.AlpakaService = {}
process.load('HeterogeneousCore.CUDAServices.CUDAService_cfi')
from HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi import AlpakaServiceCudaAsync as _AlpakaServiceCudaAsync
process.AlpakaServiceCudaAsync = _AlpakaServiceCudaAsync.clone(
    verbose = True,
    hostAllocator = dict(
    binGrowth = 2,
    minBin = 8,                           # 256 bytes
    maxBin = 30,                          #   1 GB
    maxCachedBytes = 64*1024*1024*1024,   #  64 GB
    maxCachedFraction = 0.8,              # or 80%, whatever is less
    fillAllocations = True,
    fillAllocationValue = 0xA5,
    fillReallocations = True,
    fillReallocationValue = 0x69,
    fillDeallocations = True,
    fillDeallocationValue = 0x5A,
    fillCaches = True,
    fillCacheValue = 0x96
    ),
    deviceAllocator = dict(
    binGrowth = 2,
    minBin = 8,                           # 256 bytes
    maxBin = 30,                          #   1 GB
    maxCachedBytes = 8*1024*1024*1024,    #   8 GB
    maxCachedFraction = 0.8,              # or 80%, whatever is less
    fillAllocations = True,
    fillAllocationValue = 0xA5,
    fillReallocations = True,
    fillReallocationValue = 0x69,
    fillDeallocations = True,
    fillDeallocationValue = 0x5A,
    fillCaches = True,
    fillCacheValue = 0x96
    )
)
process.options.wantSummary = True
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF

# Run the HLT configuration with cmsRun and redirect output to log file
cmsRun ${hlt_config_file} &> ${hlt_log_file}

done
jsamudio commented 3 months ago

Investigating from the PF side.

trocino commented 3 months ago

Just reporting another instance in run 382654 (full log file attached). old_hlt_run382654_pid1746184.log

VinInn commented 3 months ago

The crash is actually due to the hostAllocator. I was missing the fix #45210 and it clearly seems to be due to the same cause. Trying again using patch2.

VinInn commented 3 months ago

starting from CMSSW_14_0_9_patch2_MULTIARCHS it crashes as well. setting to fillXYZ to False in the hostAllocator it goes through.

BUT setting MALLOC_CONF=junk:true crashes so patch2 DOES NOT contain #45210. Given that @fwyzard branch has been deleted: how How can I merge it?

mmusich commented 3 months ago

Given that @fwyzard branch has been deleted: how How can I merge it?

use the last 14.0.X IB?

mmusich commented 3 months ago

otherwise you can manually merge a patch with git apply.

VinInn commented 3 months ago

found CMSSW_14_0_MULTIARCHS_X_2024-07-02-1100 . trying it

VinInn commented 3 months ago

So starting from CMSSW_14_0_MULTIARCHS_X_2024-07-02-1100 MALLOC_CONF=junk:true does not make cmsRun to crash.

setting the hostAllocator to "True" crashes as above. setting to "False" (so no memory filling) it goes through. The device allocator is set to True so it is filled with the non-zero pattern.

VinInn commented 3 months ago

btw: the final crash is somewhere in HCAL HcalSiPMCharacteristicsGPU

Thread 1 (Thread 0x7fac8f5b6640 (LWP 3388387) "cmsRun"):
#0  0x00007fac9018fac1 in poll () from /usr/lib64/libc.so.6
#1  0x00007fac8735a0cf in full_read.constprop () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fac8730e1ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fac8730e370 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fac900ab52f in raise () from /usr/lib64/libc.so.6
#6  0x00007fac9007ee65 in abort () from /usr/lib64/libc.so.6
#7  0x00007fac90a98a49 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#8  0x00007fac90aa406a in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#9  0x00007fac90aa30d9 in __cxa_call_terminate (ue_header=0x7fabb02de380) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#10 0x00007fac90aa37f6 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=<optimized out>, context=0x7ffdcd6d51b0) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:688
#11 0x00007fac9066b864 in _Unwind_RaiseException_Phase2 (exc=0x7fabb02de380, context=0x7ffdcd6d51b0, frames_p=0x7ffdcd6d50b8) at ../../../libgcc/unwind.inc:64
#12 0x00007fac9066c2bd in _Unwind_Resume (exc=0x7fabb02de380) at ../../../libgcc/unwind.inc:242
#13 0x00007fac8950501d in cms::cuda::free_device(int, void*) [clone .cold] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/libHeterogeneousCoreCUDAUtilities.so
#14 0x00007fac131a446d in edm::eventsetup::CallbackProductResolver<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<ConvertingESProducerT<HcalSiPMCharacteristicsRcd, HcalSiPMCharacteristicsGPU, HcalSiPMCharacteristics>, std::unique_ptr<HcalSiPMCharacteristicsGPU, std::default_delete<HcalSiPMCharacteristicsGPU> >, HcalSiPMCharacteristicsRcd, edm::eventsetup::CallbackSimpleDecorator<HcalSiPMCharacteristicsRcd> >(ConvertingESProducerT<HcalSiPMCharacteristicsRcd, HcalSiPMCharacteristicsGPU, HcalSiPMCharacteristics>*, std::unique_ptr<HcalSiPMCharacteristicsGPU, std::default_delete<HcalSiPMCharacteristicsGPU> > (ConvertingESProducerT<HcalSiPMCharacteristicsRcd, HcalSiPMCharacteristicsGPU, HcalSiPMCharacteristics>::*)(HcalSiPMCharacteristicsRcd const&), edm::eventsetup::CallbackSimpleDecorator<HcalSiPMCharacteristicsRcd> const&, edm::es::Label const&)::{lambda(HcalSiPMCharacteristicsRcd const&)#1}, std::unique_ptr<HcalSiPMCharacteristicsGPU, std::default_delete<HcalSiPMCharacteristicsGPU> >, HcalSiPMCharacteristicsRcd, edm::eventsetup::CallbackSimpleDecorator<HcalSiPMCharacteristicsRcd> >, HcalSiPMCharacteristicsRcd, std::unique_ptr<HcalSiPMCharacteristicsGPU, std::default_delete<HcalSiPMCharacteristicsGPU> > >::invalidateCache() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoLocalCaloHcalRecProducers.so
#15 0x00007fac92b2f561 in edm::eventsetup::EventSetupRecordImpl::invalidateProxies() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#16 0x00007fac92b2f5bf in edm::FunctorWaitingTask<edm::eventsetup::EventSetupRecordIOVQueue::startNewIOVAsync(edm::WaitingTaskHolder const&, edm::WaitingTaskList&)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda(std::__exception_ptr::exception_ptr const*)#1}>::execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fac92ac8bbe in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fac91301281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fac8dedbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#19 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fac8dedbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#20 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#21 0x00007fac92adbb76 in edm::EventProcessor::taskCleanup() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02844/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_MULTIARCHS_X_2024-06-30-0000/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#22 0x0000000000404896 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const [clone .cold] ()
#23 0x00007fac912ed9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>'
#24 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#25 0x000000000040517c in main ()
jsamudio commented 3 months ago

Using this recipe: https://github.com/cms-sw/cmssw/issues/44923#issuecomment-2199709930 + #45210 on top. I replicate the crashes. I note that I see an invalid HCAL detId being fed into the PF RecHit sequence that is common across these crashing runs. detId = 1768515945. I am trying to sort out if we handle invalid detIds in a good way for PF.

I also tried setting all the PF Alpaka modules manually to the serial_sync backend and I get the following log file from run 380624: serialPF.log

I tried using gdb on the same run again with PF on serial_sync, and had some inconsistent results. One instance of the cmsRun job completed successfully, and the next gave a crash with:

CachingAllocator::free() caught an alpaka error: /cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/uniformCudaHip/Set.hpp(110) 'TApi::memsetAsync( getPtrNative(view), static_cast<int>(this->m_byte), extentWidthBytes, queue.getNativeHandle())' returned error  : 'cudaErrorMisalignedAddress': 'misaligned address'!
        alpaka::DevUniformCudaHipRt<alpaka::ApiCudaRt> NVIDIA A100-PCIE-40GB freed 256 bytes at 0x7fff053ffe00 from associated queue 0x7ffee8cb7110, event 0x7ffeda7d3820 .
                 82 available blocks cached (86221056 bytes), 12 live blocks (100827648 bytes) outstanding.

/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorMisalignedAddress': 'misaligned address'!
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp(266) 'TApi::free(ptr)' returned error  : 'cudaErrorMisalignedAddress': 'misaligned address'!

I am not sure if any of this is necessarily helpful, but I am trying to collect as much information as I can.

VinInn commented 3 months ago

This is indeed the same symptoms of the ECAL issue and the solution is most probably similar to the one in #45210 to be implemented where some ID mapping is made for HCAL. (A better solution would be to initialize to a non-zero value (say InvalidID=0xffffffff) and then test InvalidId instead of 0)

I would say this is for HCAL-DPG .

VinInn commented 3 months ago

btw 1768515945 is 0x69696969 that corresponds to "fillReallocationValue"

fwyzard commented 3 months ago

I note that I see an invalid HCAL detId being fed into the PF RecHit sequence that is common across these crashing runs. detId = 1768515945.

1768515945 is 0x69696969, which is what the CachingAllocator uses to fill a re-used memory buffer.

These are likely coming from memory that is accessed without being initialised, and normally do not cause issues because it zero-initialised "by chance".

fwyzard commented 3 months ago

I would say this is for HCAL-DPG .

In the current menu this may also for PF, since they wrote the SoA converter.

I wonder if #45324 and customizeHLTforAlpakaPFSoA() fix this, or not.

fwyzard commented 3 months ago

By the way, in case this helps debugging: these invalid DetId come from memory that has been initialised on the host, not on the device.

VinInn commented 3 months ago

Very subtle bug: initializing to zero DOES NOT avoid the crash It seems to rely on the previous value being valid (that in a SoA may work)

fwyzard commented 3 months ago

Interesting !

fwyzard commented 3 months ago

The invalid HCAL DetIds come from uninitialised memory in RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc.

If I add

--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc
@@ -34,6 +34,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
         printf("Found %d recHits\n", num_recHits);

       reco::CaloRecHitHostCollection hostProduct{num_recHits, event.queue()};
+      alpaka::memset(event.queue(), hostProduct.buffer(), 0xFF);
       auto& view = hostProduct.view();

       for (int i = 0; i < num_recHits; i++) {

the value reported for the invalid HCAL DetIds changes to 0xFFFFFFFF.

VinInn commented 3 months ago

is this correct convertRecHit(hcal::RecHitHostCollection::View::element to, or shouldl be convertRecHit(hcal::RecHitHostCollection::View::element & to, ?

is correct, sorry for the noise

fwyzard commented 3 months ago

I think it's correct: the View::element is a lightweight proxy that holds the pointers to the actual values in the SoA column, so passing it by value should be fine.

In fact, I'm not sure we can pass it by reference, because it's temporary. Maybe we could pass it by && ? But I'm never sure what the semantic is with &&...

VinInn commented 3 months ago

convertRecHit(hcal::RecHitHostCollection::View::element & to, does not compile

VinInn commented 3 months ago

There is something very fishy going on: So I added

diff --git a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc
index 9a912d65e99..fb812618d22 100644
--- a/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc
+++ b/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc
@@ -37,13 +37,16 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
         printf("Found %d recHits\n", num_recHits);

       hcal::RecHitHostCollection hostProduct{num_recHits, event.queue()};
+      alpaka::memset(event.queue(), hostProduct.buffer(), 0xFF);
       auto& view = hostProduct.view();

       for (int i = 0; i < num_recHits; i++) {
         convertRecHit(view[i], recHits[i]);

-        if (DEBUG && i < 10)
-          printf("recHit %4d %u %f %f\n", i, view.detId(i), view.energy(i), view.timeM0(i));
+        if (0xffffffff==view.detId(i)) {
+          printf("view %4d %u %f %f\n", i, view.detId(i), view.energy(i), view.timeM0(i));
+          printf("recHit %4d %u %f %f\n", i, recHits[i].id().rawId(), recHits[i].energy(), recHits[i].time());
+        }
       }

       hcal::RecHitDeviceCollection deviceProduct{num_recHits, event.queue()};

and I see plenty of invalid detId: 4294967295 but NOT my printout...

VinInn commented 3 months ago

if I change

       hcal::RecHitHostCollection hostProduct{num_recHits, event.queue()};
+      alpaka::memset(event.queue(), hostProduct.buffer(), 0xFF);
+      alpaka::wait(event.queue());
       auto& view = hostProduct.view();

It does not crash anymore (and no invalid DetId reported) If I switch on filling in the config it may crash in other locations (that we have seen already: not always it reported invalid detid)

Is it possible that alpaka::memset(event.queue(), hostProduct.buffer(), 0xFF); is NOT synchronous? (which does not make sense for the host buffer)