Open iarspider opened 4 hours ago
cms-bot internal usage
A new Issue was created by @iarspider.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
RelVal 160.03502
should be disabled for ROCM IBs since it is a CUDA-only workflow.
assign heterogeneous
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
DataFormats/SoATemplate/testRocmSoALayoutAndView_t
This is actually the expected behaviour, well, kind of.
The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application.
So, we want the test to fail, maybe not this badly ?
HeterogeneousCore/AlpakaInterface/alpakaTestBufferROCmAsync
This is the same as https://github.com/cms-sw/cmssw/issues/46624#issuecomment-2462696659 .
HeterogeneousCore/AlpakaInterface/alpakaTestPrefixScanROCmAsync
I'm investigating this together with @AuroraPerego . It seems a problem with the test itself rather than with the functionality being tested, I should have a fix soon.
Relval 141.008583 step 2
This is currently implemented as a CUDA-only workflow ('--accelerators': 'gpu-nvidia'
).
@AdrianoDee do you know if this uses the alpaka version of the modules (then it could be changed to use '--accelerators': 'gpu-*'
) or the cuda version (then it should be disabled for the AMD tests) ?
In fact, I think all
Relval 141.0085xx step 3
workflows are CUDA-only and should not be run for the AMD GPU tests.
When possible I'll start looking at the *.40x
workflows.
Yes, all these are data RelVals using the old CUDA setup.
All the 141.*
+ 160.*
DataFormats/SoATemplate/testRocmSoALayoutAndView_t
This is actually the expected behaviour, well, kind of.
The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application.
So, we want the test to fail, maybe not this badly ?
Yeah, turning the error into an exception (that could be checked in the test itself) would be highly desirable.
In CMSSW_14_2_ROCM_X_2024-11-06-2300 we observe multiple Unit test and RelVal failures:
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
Device-side assertion '0 == blockDimension % warpSize' failed.
followed byHSA_STATUS_ERROR_EXCEPTION
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
roc::DmaBlitManager::hsaCopyStaged
roc::DmaBlitManager::hsaCopyStaged
roc::DmaBlitManager::hsaCopyStaged
(SIGABRTs are either
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
orHSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources