Runtime crash when forcing only pixel tracking+vertexing on `serial_sync` backend

missirol commented 1 month ago

The test in [1] crashes at runtime in CMSSW_14_0_14 when running on a machine with a GPU (I did not try on a machine without one). The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to "serial_sync" (in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU). This mimics the setup that the HIon group plans to implement in the lead-lead trigger menu of 2024 (see CMSHLT-3284) [*].

The stack trace is in [2]. The crash does not happen if one uses options.accelerators = ['cpu'].

Is [1] supposed to work ? If so, what's going wrong ?

[*] This 'mixed' approach (pixel local reconstruction on GPU, tracking and vertexing on CPU) has already been used in the 2023 HIon run, back then using the CUDA-based implementation of the pixel reconstruction. Pixel tracking is currently not offloaded to GPUs in the HIon menu because this leads to excessive GPU memory consumption (then, runtime crashes) in lead-lead events (at least with current data-taking conditions and current HLT hardware).

[1]

#!/bin/bash

[ $# -ge 1 ] || exit 1

hltLabel=hlt
outDir="${1}"

[ ! -d "${outDir}" ] || exit 1

mkdir -p "${outDir}"
cd "${outDir}"

hltGetConfiguration /dev/CMSSW_14_0_0/GRun/V173 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --output none \
  --max-events -1 \
  --paths MC_*Tracking* \
  --input root://eoscms.cern.ch//eos/cms/store/group/tsg/STEAM/validations/GPUVsCPU/240814/raw_pickevents.root \
  > "${hltLabel}".py

cat <<@EOF >> "${hltLabel}".py
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.options.wantSummary = True

for foo in ['HLTAnalyzerEndpath', 'dqmOutput', 'MessageLogger']:
    if hasattr(process, foo):
        process.__delattr__(foo)

process.load('FWCore.MessageLogger.MessageLogger_cfi')

#process.options.accelerators = ['cpu']

#process.hltEcalDigisSoA.alpaka.backend = 'serial_sync'
#process.hltEcalUncalibRecHitSoA.alpaka.backend = 'serial_sync'
#process.hltHbheRecoSoA.alpaka.backend = 'serial_sync'
#process.hltParticleFlowRecHitHBHESoA.alpaka.backend = 'serial_sync'
#process.hltParticleFlowClusterHBHESoA.alpaka.backend = 'serial_sync'
#process.hltOnlineBeamSpotDevice.alpaka.backend = 'serial_sync'
#process.hltSiPixelClustersSoA.alpaka.backend = 'serial_sync'
#process.hltSiPixelRecHitsSoA.alpaka.backend = 'serial_sync'
process.hltPixelTracksSoA.alpaka.backend = 'serial_sync'
process.hltPixelVerticesSoA.alpaka.backend = 'serial_sync'
@EOF

CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${hltLabel}".py &> "${hltLabel}".log

[2]

%MSG-i AlpakaService:  (NoModuleName) 15-Aug-2024 21:26:57 CEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
%MSG
%MSG-i CUDAService:  (NoModuleName) 15-Aug-2024 21:26:57 CEST pre-events
CUDA runtime version 12.2, driver version 12.4, NVIDIA driver version 550.54.15
CUDA device 0: Tesla T4 (sm_75)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 15-Aug-2024 21:26:57 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - Tesla T4
%MSG
15-Aug-2024 21:27:02 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/group/tsg/STEAM/validations/GPUVsCPU/240814/raw_pickevents.root
15-Aug-2024 21:27:04 CEST  Successfully opened file root://eoscms.cern.ch//eos/cms/store/group/tsg/STEAM/validations/GPUVsCPU/240814/raw_pickevents.root
Begin processing the 1st record. Run 381065, Event 559650924, LumiSection 307 on stream 0 at 15-Aug-2024 21:27:10.842 CEST

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Thu Aug 15 21:27:15 CEST 2024
Thread 20 (Thread 0x7f272f1ff700 (LWP 2559414) "edm async pool"):
#0  0x00007f278283b0e1 in poll () from /lib64/libc.so.6
#1  0x00007f277c96543f in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f277c91a4bc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f277c91a640 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f26f923092f in alpaka::TaskKernelCpuSerial<std::integral_constant<unsigned long, 2ul>, unsigned int, alpaka_serial_sync::caPixelDoublets::GetDoubletsFromHisto<pixelTopology::Phase1>, alpaka_serial_sync::CACellT<pixelTopology::Phase1>*, unsigned int*, cms::alpakatools::SimpleVector<cms::alpakatools::VecArray<unsigned int, 36> >*, cms::alpakatools::SimpleVector<cms::alpakatools::VecArray<unsigned short, 48> >*, TrackingRecHitSoA<pixelTopology::Phase1>::Layout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, caStructures::OuterHitOfCellT<pixelTopology::Phase1>*, unsigned int&, unsigned int const&, alpaka_serial_sync::caPixelDoublets::CellCutsT<pixelTopology::Phase1> const&>::operator()() const () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#6  0x00007f26f922272b in alpaka_serial_sync::CAHitNtupletGeneratorKernels<pixelTopology::Phase1>::buildDoublets(TrackingRecHitSoA<pixelTopology::Phase1>::Layout<128ul, false>::ConstViewTemplateFreeParams<128ul, false, true, true> const&, unsigned int, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&) () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#7  0x00007f26f922324e in alpaka_serial_sync::CAHitNtupletGenerator<pixelTopology::Phase1>::makeTuplesAsync(TrackingRecHitHost<pixelTopology::Phase1> const&, pixelCPEforDevice::ParamsOnDeviceT<pixelTopology::Phase1> const*, float, alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>&) const () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#8  0x00007f26f921c654 in alpaka_serial_sync::CAHitNtupletAlpaka<pixelTopology::Phase1>::produce(alpaka_serial_sync::device::Event&, alpaka_serial_sync::device::EventSetup const&) () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#9  0x00007f26f92161b4 in alpaka_serial_sync::stream::EDProducer<>::produce(edm::Event&, edm::EventSetup const&) () from /data/user/missirol/debug_CMSHLT3284/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerPixelSeedingPortableSerialSync.so
#10 0x00007f278528beb1 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f27852707be in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#12 0x00007f27851fb639 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#13 0x00007f27851fbba4 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#14 0x00007f27853ae178 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#15 0x00007f27839ac95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7f26fafc3c00, waiter=..., this=0x7f2780673e80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#16 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7f2780673e80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#17 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#18 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#19 0x00007f27839aeb0e in tbb::detail::r1::rml::private_worker::run (this=0x7f277ccc7100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#20 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7f277ccc7100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#21 0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#22 0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 19 (Thread 0x7f272ffff700 (LWP 2559413) "edm async pool"):
#0  0x00007f2782aea45c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f27853aebf6 in edm::impl::WaitingThread::threadLoop() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#2  0x00007f2783178a73 in std::execute_native_thread_routine (__p=0x7f26eb16ae40) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#3  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#4  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 18 (Thread 0x7f2703160700 (LWP 2559278) "cmsRun"):
#0  0x00007f2782aecda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f2782aece98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f277bf6b706 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bf6b7b9 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 17 (Thread 0x7f2703961700 (LWP 2559277) "cmsRun"):
#0  0x00007f2782aecda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f2782aece98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f277bf6b706 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bf6b7b9 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 16 (Thread 0x7f2704162700 (LWP 2559276) "cmsRun"):
#0  0x00007f2782aecda6 in do_futex_wait.constprop () from /lib64/libpthread.so.0
#1  0x00007f2782aece98 in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f277bf6b706 in XrdCl::JobManager::RunJobs() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bf6b7b9 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 15 (Thread 0x7f2704963700 (LWP 2559275) "cmsRun"):
#0  0x00007f2782aee180 in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f277c05af08 in XrdSysTimer::Wait(int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277bee003c in XrdCl::TaskManager::RunTasks() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#3  0x00007f277bee0179 in RunRunnerThread () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdCl.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 14 (Thread 0x7f2705164700 (LWP 2559274) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 13 (Thread 0x7f2705965700 (LWP 2559273) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f2706166700 (LWP 2559272) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f2706967700 (LWP 2559271) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f2707168700 (LWP 2559270) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f2707969700 (LWP 2559269) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f270816a700 (LWP 2559268) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f270896b700 (LWP 2559267) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f270916c700 (LWP 2559266) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f270996d700 (LWP 2559265) "cmsRun"):
#0  0x00007f2782846027 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f277c055282 in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#2  0x00007f277c0511ed in XrdSys::IOEvents::BootStrap::Start(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#3  0x00007f277c05a617 in XrdSysThread_Xeq () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el8_amd64_gcc12/lib/libXrdUtils.so.3
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f274d3de700 (LWP 2559177) "cuda-EvtHandlr"):
#0  0x00007f278283b0e1 in poll () from /lib64/libc.so.6
#1  0x00007f2778bc389f in ?? () from /lib64/libcuda.so.1
#2  0x00007f2778c91dcf in ?? () from /lib64/libcuda.so.1
#3  0x00007f2778bbe373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f2755037700 (LWP 2559174) "cuda0000340000e"):
#0  0x00007f278283b0e1 in poll () from /lib64/libc.so.6
#1  0x00007f2778bc389f in ?? () from /lib64/libcuda.so.1
#2  0x00007f2778c91dcf in ?? () from /lib64/libcuda.so.1
#3  0x00007f2778bbe373 in ?? () from /lib64/libcuda.so.1
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f27558dd700 (LWP 2559167) "cmsRun"):
#0  0x00007f2782aee672 in waitpid () from /lib64/libpthread.so.0
#1  0x00007f277c918147 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f277c91a3ea in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f2783178a73 in std::execute_native_thread_routine (__p=0x7f276caab6f0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007f2782ae41ca in start_thread () from /lib64/libpthread.so.0
#5  0x00007f278274fe73 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f2781c6f640 (LWP 2559147) "cmsRun"):
#0  0x00007f27828109b8 in nanosleep () from /lib64/libc.so.6
#1  0x00007f27828108be in sleep () from /lib64/libc.so.6
#2  0x00007f277c917ff0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007f27263f9070 in SiPixelFedCablingMap::pathToDetUnit(unsigned int) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libCondFormatsSiPixelObjects.so
#5  0x00007f272640167c in SiPixelQuality::getBadRocPositions(unsigned int const&, TrackerGeometry const&, SiPixelFedCabling const*) const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libCondFormatsSiPixelObjects.so
#6  0x00007f2726533d15 in MeasurementTrackerImpl::initializePixelStatus(SiPixelQuality const*, SiPixelFedCabling const*, int, int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#7  0x00007f272651cfdd in MeasurementTrackerESProducer::produce(CkfComponentsRecord const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#8  0x00007f27265288a2 in edm::eventsetup::CallbackBase<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::makeProduceTask<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda(auto:1&&, auto:2&&, auto:3&&, auto:4&&)#1}::operator()<tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&>(tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&) const::{lambda(CkfComponentsRecord const&)#1}>(tbb::detail::d1::task_group*, edm::ServiceWeakToken const&, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, bool, tbb::detail::d1::task_group*&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) const::{lambda()#2}::operator()() const () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#9  0x00007f2726528ac8 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::eventsetup::CallbackBase<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::makeProduceTask<edm::eventsetup::Callback<edm::ESProducer, edm::ESProducer::setWhatProduced<MeasurementTrackerESProducer, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >(MeasurementTrackerESProducer*, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> > (MeasurementTrackerESProducer::*)(CkfComponentsRecord const&), edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> const&, edm::es::Label const&)::{lambda(CkfComponentsRecord const&)#1}, std::unique_ptr<MeasurementTracker, std::default_delete<MeasurementTracker> >, CkfComponentsRecord, edm::eventsetup::CallbackSimpleDecorator<CkfComponentsRecord> >::prefetchAsync(edm::WaitingTaskHolder, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, edm::ServiceToken const&, edm::ESParentContext const&)::{lambda(auto:1&&, auto:2&&, auto:3&&, auto:4&&)#1}::operator()<tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&>(tbb::detail::d1::task_group*&, edm::ServiceWeakToken&, edm::eventsetup::EventSetupRecordImpl const*&, edm::EventSetupImpl const*&) const::{lambda(CkfComponentsRecord const&)#1}>(tbb::detail::d1::task_group*, edm::ServiceWeakToken const&, edm::eventsetup::EventSetupRecordImpl const*, edm::EventSetupImpl const*, bool, tbb::detail::d1::task_group*&)::{lambda(std::__exception_ptr::exception_ptr const*)#1}::operator()(std::__exception_ptr::exception_ptr const*) const::{lambda()#2}>(tbb::detail::d1::task_group&, tbb::detail::d1::task_group*&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/pluginRecoTrackerMeasurementDetPlugins.so
#10 0x00007f27853af650 in tbb::detail::d1::function_task<edm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreConcurrency.so
#11 0x00007f27839b5281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f2780673e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7f2780673e00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#13 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#14 0x00007f278517ecfb in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#15 0x00007f278518866a in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#16 0x00007f2785188bc1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el8_amd64_gcc12/libFWCoreFramework.so
#17 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#18 0x00007f27839a19ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#19 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#20 0x000000000040517c in main ()

Current Modules:

Module: CAHitNtupletAlpakaPhase1@alpaka:hltPixelTracksSoA (crashed)
Module: none

A fatal system signal has occurred: segmentation violation

cmsbuild commented 1 month ago

cms-bot internal usage

cmsbuild commented 1 month ago

A new Issue was created by @missirol.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 month ago

assign heterogeneous, reconstruction, hlt

cmsbuild commented 1 month ago

New categories assigned: heterogeneous,reconstruction,hlt

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 month ago

Let's tag @cms-sw/tracking-pog-l2

makortel commented 1 month ago

The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to "serial_sync" (in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU).

Is [1] supposed to work ?

Theoretically I'd expect it to work, at least from the framework point of view.

slava77 commented 1 month ago

type tracking

mmusich commented 4 weeks ago

@AdrianoDee FYI

makortel commented 4 weeks ago

Compiling with debug symbols points the crash to occur in https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/RecoTracker/PixelSeeding/plugins/alpaka/CAPixelDoubletsAlgos.h#L270

makortel commented 4 weeks ago

Some additional information from a debugger session

pIndex = 0
kl = 31
kk = 31
khh = 17
hoff = 256
phiBinner.off.m_v[hoff+kk] = 6504
phiBinner.content.m_capacity = 29601

so theoretically the p[0] should be valid (p = &(phiBinner.content.m_v[phiBinner.off.m_v[hoff+kk]])), assuming the phiBinner.content.m_v gets set properly.

Looking then at the HitsConstView<TrackerTraits> hh from where the phiBinner is obtained from

hh.elements_ = 29601
# consistent with phiBinner.content.m_capacity

hh.phiBinnerStorageParameters_.addr_ = 0x7fff5393e580
phiBinner.content.m_v = 0x7fff5373e580
# phiBinner.content.m_v is exactly 2 MiB smaller than phiBinnerStorageParameters_.addr_ !
# ok, the "exactly 2 MiB" could be a coincidence

phiBinner.content.m_v is set here https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/FlexiStorage.h#L27-L30 called from OneToManyAssocBase<...>::initStorage() https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h#L49 AFAICT initStorage() is called only in zeroAndInit kernel https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h#L104 and launchZero kernel https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h#L137

I see especially the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits> https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h#L35-L45 does not call the initStorage(), or set the phiBinner.content.m_v in any other way.

I see the HistoContainer unit test does call the initStorage() after the device-to-host copy https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/test/alpaka/testHistoContainer.dev.cc#L201-L208

before inspecting the host-side data.

I think the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits> is missing the call to initStorage(), and that leads to the phiBinner.begin() to return a pointer to device memory, and then p[pIndex] to segfault.

Given the comment https://github.com/cms-sw/cmssw/blob/96d37fb42f09d54ffbd623e1754dd5301deda924/HeterogeneousCore/AlpakaInterface/test/alpaka/testHistoContainer.dev.cc#L201 means the the copyAsync() function must synchronize with alpaka::wait() before calling initStorage(). This might be sufficient at least for subsequent testing.

For the longer term, assuming we'd want to remove this alpaka::wait() call, it could be fairly straightforward to extend the CopyToHost and CopyToDevice class templates to allow a post-copy modification operation (in a way the present CopyToHost::copyAsync() resembles the acquire() method in ExternalWork/SynchronizingEDProducer, the new function would correspond the produce() method).

mmusich commented 3 weeks ago

means the the copyAsync() function must synchronize with alpaka::wait() before calling initStorage(). This might be sufficient at least for subsequent testing.

shall we have a PR for this, while a more thorough fix is developed concerning https://github.com/cms-sw/framework-team/issues/989 ? I see that Marino has a commit https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 about it (I didn't test). Let me remind that this is in the critical path for the building of the 2024 HIon menu. @cms-sw/core-l2

missirol commented 3 weeks ago

https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 is my best-guess of a patch based on the explanations in https://github.com/cms-sw/cmssw/issues/45708#issuecomment-2294166204 (thanks @makortel for debugging the problem), but I don't know if it's correct.

I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events.

makortel commented 3 weeks ago

shall we have a PR for this, while a more thorough fix is developed concerning cms-sw/framework-team#989 ? I see that Marino has a commit missirol@62620da about it (I didn't test).

Fix along https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 is needed in any case. The https://github.com/cms-sw/framework-team/issues/989 will only help to remove the alpaka::wait() call in https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2.

Let me remind that this is in the critical path for the building of the 2024 HIon menu.

Could you point me to a timeline?

Also, will the HLT use 14_0_X or 14_1_X for the HI data taking? (@missirol's test used 14_0_14, but my understanding is that 14_1_X would be the HI data taking release cycle). I'm asking early, because whether or not the outcome of https://github.com/cms-sw/framework-team/issues/989 needs to be backported impacts how it will be done (because in 14_1_X-only could use C++20 features).

missirol@62620da is my best-guess of a patch based on the explanations in #45708 (comment)

I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.

I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events.

:+1: A performance test (to see the cost of the alpaka::wait()) would also be interesting.

mmusich commented 3 weeks ago

@makortel

Could you point me to a timeline?

please refer to this Screenshot from 2024-08-19 15-57-08

notice that any further tracking update hinges on this ticket to enter first.

Also, will the HLT use 14_0_X or 14_1_X for the HI data taking?

HLT will use 14_1_X for actual data-taking, but we're still integrating updates in 14_0_X (and will continue doing so until we have CMSSW_14_1_0 out, when we'll move the confDB template for HLT menu development). Thus we'll need both a master PR and a backport of at least something along the lines of https://github.com/missirol/cmssw/commit/62620da84d02d0749868ae6281490b461905c8d2 in order to keep moving.

AdrianoDee commented 3 weeks ago

I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.

I can confirm it (see https://github.com/cms-sw/cmssw/pull/45743#discussion_r1722143309).

missirol commented 3 weeks ago

Sorry in advance for my ignorance...

missirol@62620da

I'd believe the lines relate to pbv (i.e. 46-51) would not be needed, but it would be good if e.g. @AdrianoDee could confirm.

I don't know how to remove L46; I thought the initStorage method required a PhiBinnerView as function argument.

If I remove L47-51, the reproducer crashes as follows.

cmsRun: /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:45: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE = -1]: Assertion `view.assoc == this' failed.

If I remove L48-51, the reproducer crashes as follows.

cmsRun: /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/src/HeterogeneousCore/AlpakaInterface/interface/OneToManyAssoc.h:47: void cms::alpakatools::OneToManyAssocBase<I, ONES, SIZE>::initStorage(View) [with I = unsigned int; int ONES = 2561; int SIZE = -1]: Assertion `view.contentStorage' failed.

makortel commented 3 weeks ago

Seems like they are necessary after all, thanks for trying out.

For longer term one could ask if the TrackingRecHitsSoA is really the right place for the PhiBinner etc. https://github.com/cms-sw/cmssw/issues/44700 has some related discussion, although this question might really belong to https://github.com/cms-sw/cmssw/issues/43796.

AdrianoDee commented 3 weeks ago

It's not your ignorance, it's my hastiness. I had completely missed the context here of this mixed GPU+CPU reconstruction, sorry. They are 100% needed.

missirol commented 3 weeks ago

@AdrianoDee , no worries, thanks for having a look. :)

A performance test (to see the cost of the alpaka::wait()) would also be interesting.

The check is running..

fwyzard commented 3 weeks ago

@makortel @missirol I'm wondering if - instead of blocking with an alpaka::wait(queue) between the copy and the "fix-after-copy" part - it might work better to enqueue the "fix-after-copy" part in a host task on the same device queue:

      // Update the contents address of the phiBinner histo container after the copy from device happened
      alpaka::enqueue(queue, [hits = hostData.nHits(), cview = hostData.view()]() {
        typename TrackingRecHitSoA<TrackerTraits>::PhiBinnerView pbv;
        auto view = cview;
        pbv.assoc = &(view.phiBinner());
        pbv.offSize = -1;
        pbv.offStorage = nullptr;
        pbv.contentSize = hits;
        pbv.contentStorage = view.phiBinnerStorage();
        view.phiBinner().initStorage(pbv);
      });

(the extra copy of the view is needed because the variables captured by the lambda cannot be modified, unless one uses a const_cast or declares the lambda as mutable)

missirol commented 3 weeks ago

For the record, the check below considers a recent pp menu, and shows a very small impact on HLT throughput (within uncertainties).

Reference ([json]()): CMSSW_14_0_14_MULTIARCHS.

Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   628.0 ±   0.0 ev/s (39800 events, 99.1% overlap)
   622.8 ±   0.1 ev/s (39800 events, 99.1% overlap)
   623.0 ±   0.1 ev/s (39800 events, 97.7% overlap)
   624.8 ±   0.0 ev/s (39800 events, 98.3% overlap)
 --------------------
   624.7 ±   2.4 ev/s

Target ([json]()): CMSSW_14_0_14_MULTIARCHS + https://github.com/cms-sw/cmssw/pull/45744.

Running 4 times over 40100 events with 8 jobs, each with 32 threads, 24 streams and 1 GPUs
   625.2 ±   0.0 ev/s (39800 events, 99.0% overlap)
   623.3 ±   0.0 ev/s (39800 events, 98.7% overlap)
   622.7 ±   0.0 ev/s (39800 events, 98.9% overlap)
   622.6 ±   0.0 ev/s (39800 events, 99.3% overlap)
 --------------------
   623.4 ±   1.2 ev/s

[*]

Input data: run-383631, LSs 410-480, ~40k events (PU ~63.5).
HLT menu: /cdaq/physics/Run2024/2e34/v1.4.3/HLT/V2 (recent pp menu).
Node: hilton-c2b02-44-01 (same hardware as a 2022/23 HLT node) (CPUs: 2 AMD EPYC 7763 64-Core; GPUs: 2 NVIDIA Tesla T4).
8 jobs, 32 threads and 24 streams per job.
NVIDIA MPS enabled.

makortel commented 3 weeks ago

@fwyzard Could you remind who controls the CPU thread(s) the Alpaka host task gets run in (in case of CUDA backend), and how are those threads managed?

fwyzard commented 3 weeks ago

The implementation has changed a few times since we started using alpaka, so I'm not 100% sure.

From a quick look at the code, it seems that each CUDA queue has a lazy-initialised host thread associated to it; this thread is initialised the first time a host task is submitted.

The thread has a queue of tasks to execute. Each task is enqueued in a blocked state, waiting on a condition variable. A CUDA callback is used to notify the condition variable, unblock the thread, and execute the task.

makortel commented 3 weeks ago

Thanks @fwyzard. My feeling is that the Alpaka host task might help if this alpaka::wait() call would have a visible impact (which doesn't seem to be the case though in @missirol's test). On the other hand, its implementation sounds complex-enough that for the longer term I'd expect the extension of CopyTo{Host,Device} to be worth it (e.g. it won't add any new synchronization calls to the system).

mmusich commented 2 weeks ago

solutions proposed in the short term:

https://github.com/cms-sw/cmssw/pull/45743 (master)
https://github.com/cms-sw/cmssw/pull/45744 (backport, to be part of CMSSW_14_0_15).

mmusich commented 2 weeks ago

from the HLT menu point of view, the PRs proposed at https://github.com/cms-sw/cmssw/issues/45708#issuecomment-2309429735 are enough to allow the continuing developing the 2024 HIon menu (see this comment in CMSHLT-3824).
let's keep the issue open such that the discussion among heterogeneous s/w experts (started at https://github.com/cms-sw/cmssw/issues/45708#issuecomment-2297515997) can continue

EDIT: it looks like these PRs generated the issue https://github.com/cms-sw/cmssw/issues/45834, thus removing the hlt signature.

makortel commented 2 weeks ago

For the longer term, assuming we'd want to remove this alpaka::wait() call, it could be fairly straightforward to extend the CopyToHost and CopyToDevice class templates to allow a post-copy modification operation (in a way the present CopyToHost::copyAsync() resembles the acquire() method in ExternalWork/SynchronizingEDProducer, the new function would correspond the produce() method).

A possibility for CopyToHost<T>::postCopy() is added in https://github.com/cms-sw/cmssw/pull/45801. (such a facility is not needed for CopyToDevice, that can do similar operation by enqueuing a kernel call to to the queue)

fwyzard commented 1 week ago

A possibility for CopyToHost<T>::postCopy() is added in #45801. (such a facility is not needed for CopyToDevice, that can do similar operation by enqueuing a kernel call to to the queue)

Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x.

makortel commented 1 week ago

A possibility for CopyToHost<T>::postCopy() is added in #45801. (such a facility is not needed for CopyToDevice, that can do similar operation by enqueuing a kernel call to to the queue)

Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x.

Ok.

mmusich commented 2 days ago

+hlt

see https://github.com/cms-sw/cmssw/issues/45708#issuecomment-2309429735

jfernan2 commented 2 days ago

+1

makortel commented 2 days ago

+heterogeneous

makortel commented 2 days ago

@cmsbuild, please close

cmsbuild commented 2 days ago

This issue is fully signed and ready to be closed.

cms-sw / cmssw

Runtime crash when forcing only pixel tracking+vertexing on `serial_sync` backend #45708