cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

[ROCM_X] Multiple RelVals failing in ROCM_X IB #46624

Open iarspider opened 4 hours ago

iarspider commented 4 hours ago

In CMSSW_14_2_ROCM_X_2024-11-06-2300 we observe multiple Unit test and RelVal failures:

What failed Description
DataFormats/SoATemplate/testRocmSoALayoutAndView_t HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
HeterogeneousCore/AlpakaInterface/alpakaTestBufferROCmAsync HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception
HeterogeneousCore/AlpakaInterface/alpakaTestPrefixScanROCmAsync Many Device-side assertion '0 == blockDimension % warpSize' failed. followed by HSA_STATUS_ERROR_EXCEPTION
Relval 141.008583 step 2 ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
Relval 29834.403 step 2 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources
Relval 29834.404 step 2 StdException
Relval 141.008507 step 3 ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
Relval 141.008508 step 3 Fatal exception: Unable to choose current device because CUDAService is not preset or disabled. If CUDAService was not explicitly disabled in the configuration, the probable cause is that there is no GPU or there is some problem in the CUDA runtime or drivers.
Relval 141.008513 step 3 ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
Relval 141.008514 step 3 BadAlloc
Relval 141.008523 step 3 ModuleTypeResolverAlpaka had no backends available because of the combination of the job configuration and accelerator availability of on the machine. The job sees accelerators
Relval 141.008524 step 3 BadAlloc
Relval 12834.402 step 3 SIGSEGV in roc::DmaBlitManager::hsaCopyStaged
Relval 13034.402 step 3 SIGABRT
Relval 13034.404 step 3 SIGABRT
Relval 13034.406 step 3 SIGABRT
Relval 13034.408 step 3 SIGABRT
Relval 13050.402 step 3 SIGABRT
Relval 13050.404 step 3 SIGABRT
Relval 13050.406 step 3 SIGSEGV in roc::DmaBlitManager::hsaCopyStaged
Relval 13050.408 step 3 SIGABRT
Relval 13061.402 step 3 SIGSEGV in roc::DmaBlitManager::hsaCopyStaged
Relval 29634.402 step 3 SIGABRT
Relval 29834.402 step 3 SIGABRT
Relval 160.03502 step 4 BadAlloc

(SIGABRTs are either HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception or HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources

cmsbuild commented 4 hours ago

cms-bot internal usage

cmsbuild commented 4 hours ago

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

aandvalenzuela commented 4 hours ago

RelVal 160.03502 should be disabled for ROCM IBs since it is a CUDA-only workflow.

makortel commented 4 hours ago

assign heterogeneous

cmsbuild commented 4 hours ago

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard commented 3 hours ago

DataFormats/SoATemplate/testRocmSoALayoutAndView_t

This is actually the expected behaviour, well, kind of.

The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application.

So, we want the test to fail, maybe not this badly ?

fwyzard commented 3 hours ago

HeterogeneousCore/AlpakaInterface/alpakaTestBufferROCmAsync

This is the same as https://github.com/cms-sw/cmssw/issues/46624#issuecomment-2462696659 .

fwyzard commented 3 hours ago

HeterogeneousCore/AlpakaInterface/alpakaTestPrefixScanROCmAsync

I'm investigating this together with @AuroraPerego . It seems a problem with the test itself rather than with the functionality being tested, I should have a fix soon.

fwyzard commented 2 hours ago

Relval 141.008583 step 2

This is currently implemented as a CUDA-only workflow ('--accelerators': 'gpu-nvidia').

@AdrianoDee do you know if this uses the alpaka version of the modules (then it could be changed to use '--accelerators': 'gpu-*') or the cuda version (then it should be disabled for the AMD tests) ?

fwyzard commented 2 hours ago

In fact, I think all

Relval 141.0085xx step 3

workflows are CUDA-only and should not be run for the AMD GPU tests.

fwyzard commented 2 hours ago

When possible I'll start looking at the *.40x workflows.

AdrianoDee commented 1 hour ago

Yes, all these are data RelVals using the old CUDA setup.

AdrianoDee commented 1 hour ago

All the 141.* + 160.*

makortel commented 1 hour ago

DataFormats/SoATemplate/testRocmSoALayoutAndView_t

This is actually the expected behaviour, well, kind of.

The test causes an error on the GPU and tries to report that. However for the HIP/ROCm runtime the GPU-side error or "hardware exception" results in a crash/abort of the CPU-side application.

So, we want the test to fail, maybe not this badly ?

Yeah, turning the error into an exception (that could be checked in the test itself) would be highly desirable.