cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

`maxIterGPU` assert failure in `alpaka_serial_sync::pixelClustering::FindClus` #44077

Closed missirol closed 7 months ago

missirol commented 7 months ago

Running a recent HLT menu with customizeHLTforAlpaka in CMSSW_14_0_0 as in [1] leads to a runtime error.

cmsRun: src/RecoLocalTracker/SiPixelClusterizer/plugins/alpaka/PixelClustering.h:302:
void alpaka_serial_sync::pixelClustering::FindClus<TrackerTraits>::operator()
(const TAcc&, SiPixelDigisSoAView, SiPixelClustersSoAView, unsigned int) const
[with TAcc = alpaka::AccCpuSerial<std::integral_constant<long unsigned int, 1>, unsigned int>;
TrackerTraits = pixelTopology::Phase1;
SiPixelDigisSoAView = SiPixelDigisLayout<>::ViewTemplateFreeParams<128, false, true, false>;
SiPixelClustersSoAView = SiPixelClustersLayout<>::ViewTemplateFreeParams<128, false, true, false>]:
Assertion `(hist.size() / blockDimension) < maxIterGPU' failed.

The full stack trace from running [1] can be found in pixel_findclus_cpu.log. Note that [1] forces the job to run on CPU only.

A similar crash occurs also on GPU (stack track in pixel_findclus_gpu.log), but a GPU is not needed to reproduce the issue.

There is no runtime error if the Alpaka customisation is not used.

Could experts please have a look ?

FYI: @AdrianoDee @borzari @fwyzard @cms-sw/hlt-l2

[1]

#!/bin/bash

# CMSSW_14_0_0

hltGetConfiguration \
  /dev/CMSSW_14_0_0/GRun/V33 \
  --globaltag 140X_dataRun3_HLT_v1 \
  --max-events -1 \
  --input foo.root \
  --no-output \
  --no-prescale \
  --customise HLTrigger/Configuration/customizeHLTforAlpaka.customizeHLTforAlpaka \
  > hlt.py

cat <<@EOF >> hlt.py

del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.source.fileNames = [
    '/store/data/Run2023D/EphemeralHLTPhysics0/RAW/v1/000/370/293/00000/303b0104-65f3-4518-8b32-8de062eb9713.root',
]

process.options.accelerators = ['cpu']

process.source.skipEvents = cms.untracked.uint32(109)
@EOF

edmConfigDump hlt.py > hlt_dump.py

cmsRun hlt.py &> hlt.log

PS. Just for my own reference, I encountered this crash while testing a recent HLT menu in 14_0_X on one of the HiLTON nodes as described here.

cmsbuild commented 7 months ago

cms-bot internal usage

cmsbuild commented 7 months ago

A new Issue was created by @missirol.

@rappoccio, @makortel, @antoniovilela, @Dr15Jones, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mmusich commented 7 months ago

assign heterogeneous, reconstruction, hlt

@cms-sw/trk-dpg-l2 FYI

cmsbuild commented 7 months ago

New categories assigned: heterogeneous,reconstruction,hlt

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard commented 7 months ago

The assertion fails because in RecoLocalTracker/SiPixelClusterizer/plugins/alpaka/PixelClustering.h we end up with

hist.size(): 4529
block elements: 256

that require 17 iterations, while maxIterGPU is 16... so 17 < 16 fails.

In the CUDA version the block size is 384 (to accommodate for TrackerTraits::maxPixInModule which is 6000, divided by 16 and round up by 128). In the alpaka version the block size is 256 (which seems arbitrary).

fwyzard commented 7 months ago

Should be fixed by #44081 and #44082.

mmusich commented 7 months ago

+hlt

explicitly tested with:

cmsrel CMSSW_14_0_0
cd CMSSW_14_0_0/src/
cmsenv
git cms-merge-topic 44082
scram b -j 20

and then following the recipe at https://github.com/cms-sw/cmssw/issues/44077#issue-2152647187

fwyzard commented 7 months ago

+heterogeneous