CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
157 stars 27 forks source link

[OpenCL] CHIP_USE_INTEL_USM option fails on Intel CPU driver #859

Open pvelesko opened 1 month ago

pvelesko commented 1 month ago
The following tests FAILED:
    605 - Unit_hipMemsetAsync_VerifyExecutionWithKernel (Failed)
    772 - Unit_hipDeviceSynchronize_Functional (Failed)
    812 - Unit_hipTextureObj1DCheckRGBAModes - array (Timeout)
    813 - Unit_hipTextureObj1DCheckRGBAModes - buffer (Timeout)
    814 - Unit_hipTextureObj2DCheckRGBAModes (Failed)
    848 - hipMemset_Unit_hipMemsetAsync_SetMemoryWithOffset_Helgrind (Failed)
    875 - TestWholeProgramCompilation (Failed)
    888 - firstTouch (Failed)
    892 - TestLazyModuleInit (Failed)
    896 - TestLargeGlobalVar (Subprocess aborted)
    898 - TestGlobalVarInit (Subprocess aborted)
    905 - TestIndirectMappedHostAlloc (Failed)
    921 - hipBlas-sgemm (Failed)
    939 - memcpy3D (Failed)
    944 - hipTestSymbolReset (Subprocess aborted)
    945 - hipTestSymbolInit (Subprocess aborted)
    947 - hipTestVariableTemplateSymbols (Subprocess aborted)
    984 - cuda-convolutionSeparable (Failed)
    987 - cuda-binomialoptions (Failed)
    989 - cuda-qrng (Failed)
    994 - cuda-FDTD3d (Failed)

Most failing with

870/995 Test #891: TestLazyModuleInit ........................................................***Failed    0.79 sec
CHIP error [TID 348134] [1716469923.170854238] : hipErrorTbd (CL_INVALID_OPERATION ) in /home/pvelesko/space/chipStar/main/src/backend/OpenCL/CHIPBackendOpenCL.cc:1276:launchImpl

CHIP error [TID 348134] [1716469923.171270931] : Caught Error: hipErrorTbd
pvelesko commented 1 month ago
╭─pvelesko@cupcake ~/space/chipStar/main ‹b39011d5●›
╰─$ git bisect good                                                                                                                                                                                                                                                                                                                                                     1 ↵
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[f06b69466f8dc5dd55fb42a639c0e4d3f4d83495] OpenCL: Fix USM indirect flags probably caused issues
╭─pvelesko@cupcake ~/space/chipStar/main ‹f06b6946●›
pvelesko commented 1 month ago

Manually setting CHIP_USE_INTEL_USM=OFF allows the tests to run to completion successfully.

linehill commented 1 month ago

Skipping clSetKernelExecInfo() on USM allocations works too (at least for TestLazyModuleInit, TestLargeGlobalVar, TestGlobalVarInit):

diff --git a/src/backend/OpenCL/CHIPBackendOpenCL.cc b/src/backend/OpenCL/CHIPBackendOpenCL.cc
index eb2de777..ee32a427 100644
--- a/src/backend/OpenCL/CHIPBackendOpenCL.cc
+++ b/src/backend/OpenCL/CHIPBackendOpenCL.cc
@@ -257,6 +257,7 @@ annotateIndirectPointers(const CHIPContextOpenCL &Ctx,
     break;
   case AllocationStrategy::IntelUSM:
     PtrListName = CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL;
+    return nullptr; // DEBUG
     break;
   case AllocationStrategy::BufferDevAddr:
     PtrListName = CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT;

It seems that passing anything (valid USM allocations) in CL_KERNEL_EXEC_INFO_USM_PTRS_INTEL lists break kernel launches. A bug in the driver?