CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
182 stars 29 forks source link

minimod benchmark from HeCBench fails result verification (Level0 backend) #600

Closed franz closed 11 months ago

franz commented 12 months ago

This could be an issue in chipStar or the benchmark, needs to be investigated.

with chipStar:

./scripts/autohecbench.py -v minimod-hip
....
ndamp = 27 27 27
grid = 100 100 100
FINAL min_u,  max_u = 0.000000, 0.000000
FAIL

with SYCL:

./scripts/autohecbench.py -v minimod-sycl
ndamp = 27 27 27
grid = 100 100 100
FINAL min_u,  max_u = -0.205791, 0.140146
PASS
zjin-lcf commented 11 months ago

The HIP program passes the check on an AMD GPU. Thanks.

pjaaskel commented 11 months ago

Does the OpenCL BE work?

pjaaskel commented 11 months ago

Also fails with OpenCL.

pjaaskel commented 11 months ago

The test is buggy. It accesses out of bounds of buffers it allocates. It's likely due to having the group sizes computed wrongly and they produce WIs that access data past the grid size. This can be made visible with the following printout (also other reads from u produce these:

diff --git a/minimod-hip/minimig.cu b/minimod-hip/minimig.cu
index 28ed76ea..f1a337a9 100644
--- a/minimod-hip/minimig.cu
+++ b/minimod-hip/minimig.cu
@@ -109,12 +109,32 @@ __global__ void target_pml_3d_kernel(
     const llint suj = tj + R;
     const llint suk = tk + R;

+    const llint size_u = (nx + 2 * lx) * (ny + 2 * ly) * (nz + 2 * lz);
+
     const int z_side = ti / R;
-    s_u[ti+z_side*NDIM][suj][suk] = u[IDX3_l(i+(z_side*2-1)*R,j,k)];
+    int idx = IDX3_l(i+(z_side*2-1)*R,j,k);
+    if (idx >= size_u) {
+      printf("u access OOB at %d (past %d items)\n", idx, idx - size_u);
+      s_u[ti+z_side*NDIM][suj][suk] = 0.f;
+    } else
+      s_u[ti+z_side*NDIM][suj][suk] = u[idx];

Why it works for AMD I do not know, but I'd need to understand the test case to fix it. Very likely not a chipStar issue as this test case is quite simple.

zjin-lcf commented 11 months ago

Thank you for reporting the issue.