Closed Sam-Harper closed 3 weeks ago
cms-bot internal usage
A new Issue was created by @Sam-Harper.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@cms-sw/pf-l2 FYI
@waredjeb @jsamudio FYI
type pf
(in the meantime)
"HCAL had just come back into global after doing tests" -
according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.
HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.
Allocation of PF rechit fraction SoA is currently the number of rechits nRH
* 250
. In this particular event, nRH = 9577
, so the maximum index of the rechit fraction SoA is 2394250
and we are needing 2428285
. Number of seeds (1899
) and number of topological clusters (257
) all seem reasonable. In my mind this is just the same situation as #44634. Dynamic allocation of the rechit fraction SoA would probably alleviate this in a way that does not abuse the GPU memory.
I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard? Otherwise the "safest" configuration is nRH
*nRH
and this would be unrealistic.
I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?
It depends on what is needed for the dynamic allocation.
If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.
If it also requires splitting a kernel in two, it may add some overhead.
I remember during the Alpaka clustering development the recommendation was to allocate as much memory as needed in the configuration since dynamic allocations had a notable detriment to performance. Has anything changed in this regard?
It depends on what is needed for the dynamic allocation.
If the only requirement is to change a configuration value with a runtime value, I don't expect any impact.
If it also requires splitting a kernel in two, it may add some overhead.
In CUDA we had a cudaMemcpyAsync
device to host with the number of rechit fractions needed, and some cms::cuda::make_device_unique
using that number. These steps were taken between two CUDA kernel invocations in the .cu
, equivalent to between two alpaka::exec
in the .dev.cc
in Alpaka. Is such a thing possible in Alpaka or would we need to split things in the .cc
EDProducer?
(in the meantime)
"HCAL had just come back into global after doing tests" -
according to HCAL OPS conveners, HCAL test in local was a standard sanity check (deployment) of the new L1 TriggerKey (TP LUT) with one single channel response correction update. And then it went on to be used in Global. No any other/special tests/chnages (like configuration changes etc.) were done.
HCAL DQM conveners are asked to carefully revise the plots of the Cosmics run in question.
Than you Salavat for the correction. Indeed there was a small game of telephone which lead to my misunderstanding that the laser alignment tests were ongoing when they started just after this run.
@Sam-Harper my apologies, Sam... In fact HCAL OPS has already realized/admitted:
Update: DQM colleagues did confirm that HCAL barrel occupancy in the problematic event pointed out by Sam in the intro is ~90% ( ~8k hits above ZeroSuppresion), while in pp collisions it's kept at < 30% (and naturally lower in regular Cosmics).
In CUDA we had a
cudaMemcpyAsync
device to host with the number of rechit fractions needed, and somecms::cuda::make_device_unique
using that number.These steps were taken between two CUDA kernel invocations in the
.cu
, equivalent to between twoalpaka::exec
in the.dev.cc
in Alpaka.Is such a thing possible in Alpaka or would we need to split things in the
.cc
EDProducer?
It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the memcpy
.
assign hlt, reconstruction
New categories assigned: hlt,reconstruction
@Martin-Grunewald,@mmusich,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
In CUDA we had a
cudaMemcpyAsync
device to host with the number of rechit fractions needed, and somecms::cuda::make_device_unique
using that number.These steps were taken between two CUDA kernel invocations in the
.cu
, equivalent to between twoalpaka::exec
in the.dev.cc
in Alpaka.Is such a thing possible in Alpaka or would we need to split things in the
.cc
EDProducer?It's possible to implement the same logic in Alpaka, but (like for CUDA) you also need to split the EDProducer in two, to introduce the synchronisation after the
memcpy
.
Just to clarify, given that PFClusterSoAProducer
(this is the module in question, no?) is a stream::EDProducer<>
, the "splitting to two" would be about changing the base class to stream::SynchronizingEDProducer
, moving the first part of the code up to the device-to-host memcpy()
to be called from the acquire()
member function, and leaving the rest to the produce()
member function.
+1 Solved by https://github.com/cms-sw/cmssw/pull/46135
proposed solutions:
CMSSW_14_2_0_pre2
)CMSSW_14_1_1
)This issue is fully signed and ready to be closed.
@cmsbuild, please close
There were wide spread crashes (~900) in cosmics run 383219.
No changes to the HLT / release had been made around this time and no other runs had this issue either immediately preceding nor after. It should be noted that HCAL had just come back into global after doing tests. Thus seems it plausible that HCAL came back in a werid state and this is the cause of the crashes. Thus I think HCAL experts should probably review to this to event (and this run) to ensure they were sending good data to us.
The crash is fully reproducible on the hilton and also on my local CPU only machine. The crash happens if the PFClustering is run, if this is not run, the crash does not happen.
An example event which crashes is at /eos/cms/store/group/tsg/FOG/debug/240715_run383219/run383219_ls013_29703372.root
The cosmics menu run is at /eos/cms/store/group/tsg/FOG/debug/240715_run38219/hlt.py
A minimal menu with just the HCAL reco is /eos/cms/store/group/tsg/FOG/debug/240715_run383219/hltMinimal.py
The release was CMSSW_14_0_11_MULTIARCHS but is also reproduced in CMSSW_14_0_11
The error on CPU is
The error on GPU is
gpuCrash.log
@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI @cms-sw/hcal-dpg-l2 FYI