HLT Crashes in HCAL PFClustering in Cosmics Run 383219

Sam-Harper commented 3 months ago

There were wide spread crashes (~900) in cosmics run 383219.

No changes to the HLT / release had been made around this time and no other runs had this issue either immediately preceding nor after. It should be noted that HCAL had just come back into global after doing tests. Thus seems it plausible that HCAL came back in a werid state and this is the cause of the crashes. Thus I think HCAL experts should probably review to this to event (and this run) to ensure they were sending good data to us.

The crash is fully reproducible on the hilton and also on my local CPU only machine. The crash happens if the PFClustering is run, if this is not run, the crash does not happen.

An example event which crashes is at /eos/cms/store/group/tsg/FOG/debug/240715_run383219/run383219_ls013_29703372.root

The cosmics menu run is at /eos/cms/store/group/tsg/FOG/debug/240715_run38219/hlt.py

A minimal menu with just the HCAL reco is /eos/cms/store/group/tsg/FOG/debug/240715_run383219/hltMinimal.py

The release was CMSSW_14_0_11_MULTIARCHS but is also reproduced in CMSSW_14_0_11

The error on CPU is

At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
----- Begin Fatal Exception 17-Jul-2024 08:23:05 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383219 lumi: 13 event: 29703372 stream: 0
   [1] Running path 'DQM_HcalReconstruction_CPU_v8'
   [2] Calling method for module PFClusterSoAProducer@alpaka/'hltParticleFlowClusterHBHESoA'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -----------------------------------

The error on GPU is

At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
At the end of topoClusterContraction, found large *pcrhFracSize = 2428285
Out of range index in ViewTemplateFreeParams::operator[]
(repeat above line 2253 times so 2255 in total)
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/event/EventUniformCudaHipRt.hpp(66) 'TApi::eventDestroy(m_UniformCudaHipEvent)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_11_MULTIARCHS-el8_amd64_gcc12/build/CMSSW_14_0_11_MULTIARCHS-build/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/i
nclude/alpaka/mem/buf/BufUniformCudaHipRt.hpp(356) 'TApi::hostFree(ptr)' returned error  : 'cudaErrorLaunchFailure': 'unspecified launch failure'!
----- Begin Fatal Exception 17-Jul-2024 06:26:11 -----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 383219 lumi: 13 event: 29703372 stream: 0
   [1] Running path 'DQM_HcalReconstruction_v8'
   [2] Calling method for module alpaka_serial_sync::PFClusterSoAProducer/'hltParticleFlowClusterHBHESoASerialSync'
Exception Message:
A std::exception was thrown.
Out of range index in ViewTemplateFreeParams::operator[]
----- End Fatal Exception -------------------------------------------------

gpuCrash.log

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI @cms-sw/hcal-dpg-l2 FYI

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @Sam-Harper.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard commented 3 months ago

@cms-sw/pf-l2 FYI

fwyzard commented 3 months ago

@waredjeb @jsamudio FYI

swagata87 commented 3 months ago

type pf

abdoulline commented 3 months ago

(in the meantime)

"HCAL had just come back into global after doing tests" -

according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.

HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.

jsamudio commented 3 months ago

Allocation of PF rechit fraction SoA is currently the number of rechits nRH * 250. In this particular event, nRH = 9577, so the maximum index of the rechit fraction SoA is 2394250 and we are needing 2428285. Number of seeds (1899) and number of topological clusters (257) all seem reasonable. In my mind this is just the same situation as #44634. Dynamic allocation of the rechit fraction SoA would probably alleviate this in a way that does not abuse the GPU memory.

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard? Otherwise the "safest" configuration is nRH*nRH and this would be unrealistic.

fwyzard commented 3 months ago

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?

It depends on what is needed for the dynamic allocation.

If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.

If it also requires splitting a kernel in two, it may add some overhead.

jsamudio commented 3 months ago

I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?

It depends on what is needed for the dynamic allocation.

If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.

If it also requires splitting a kernel in two, it may add some overhead.

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number. These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka. Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

Sam-Harper commented 3 months ago

(in the meantime)

"HCAL had just come back into global after doing tests" -

according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.

HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.

Than you Salavat for the correction. Indeed there was a small game of telephone which lead to my misunderstanding that the laser alignment tests were ongoing when they started just after this run.

abdoulline commented 3 months ago

@Sam-Harper my apologies, Sam... In fact HCAL OPS has already realized/admitted:

laser test was performed in local and then the laser was accidentally left on (-> firing) during the aforementioned Cosmics run (in question), apparently because laser testing colleagues weren't properly notified about HCAL move to the global Cosmics run... ☹️

Update: DQM colleagues did confirm that HCAL barrel occupancy in the problematic event pointed out by Sam in the intro is ~90% ( ~8k hits above ZeroSuppresion), while in pp collisions it's kept at < 30% (and naturally lower in regular Cosmics).

fwyzard commented 3 months ago

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number.

These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka.

Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy.

makortel commented 3 months ago

assign hlt, reconstruction

cmsbuild commented 3 months ago

New categories assigned: hlt,reconstruction

@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 3 months ago

In CUDA we had a cudaMemcpyAsync device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique using that number.

These steps were taken between two CUDA kernel invocations in the .cu, equivalent to between two alpaka::exec in the .dev.cc in Alpaka.

Is such a thing possible in Alpaka or would we need to split things in the .cc EDProducer?

It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy.

Just to clarify, given that PFClusterSoAProducer (this is the module in question, no?) is a stream::EDProducer<>, the "splitting to two" would be about changing the base class to stream::SynchronizingEDProducer, moving the first part of the code up to the device-to-host memcpy() to be called from the acquire() member function, and leaving the rest to the produce() member function.

jfernan2 commented 1 month ago

+1 Solved by https://github.com/cms-sw/cmssw/pull/46135

mmusich commented 4 weeks ago

proposed solutions:

https://github.com/cms-sw/cmssw/pull/46135 (merged Oct 3rd, 2024, will enter CMSSW_14_2_0_pre2)
https://github.com/cms-sw/cmssw/pull/46136 (merged Oct 5th, 2024, to enter CMSSW_14_1_1)

mmusich commented 4 weeks ago

+hlt

cmsbuild commented 4 weeks ago

This issue is fully signed and ready to be closed.

makortel commented 3 weeks ago

@cmsbuild, please close

cms-sw / cmssw

HLT Crashes in HCAL PFClustering in Cosmics Run 383219 #45477