Changes to reduce kernel launch overheads

linehill commented 7 months ago

A patch set for reducing kernel launch overheads. These improved HeCBench's mrc, floydwarshall and overlay benchmarks by 41-57% on PVC.

Skip kernel info string construction when it is not going to be logged/displayed.
Reduce kernel launch overhead from __hipPushCallConfiguration() and __hipPopCallConfiguration() functions:
- Remove unnecessary backend initialization (backend will be initialized at hipLaunchKernel()).
- Avoid creating short living temporary, heap allocated chipStar::ExecItem.
Exit earlier from prepareDeviceVariablesNoLock().
Recycle OpenCL kernel handles instead of creating them for each exec-item.

pvelesko commented 6 months ago

Unit test failures for iGPU OpenCL

igpu_opencl_make_check_result.txt: FAIL
    377 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
    378 - ABM_AddKernel_MultiTypeMultiSize - long (SEGFAULT)
    380 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)

among others

linehill commented 6 months ago

igpu_opencl_make_check_result.txt: FAIL
  377 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
  378 - ABM_AddKernel_MultiTypeMultiSize - long (SEGFAULT)
  380 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)

What’s going on with those tests? They should be failing expectedly.

2024-03-08T03:28:15.1003549Z Name:                          Intel(R) UHD Graphics 730
(...)
2024-03-08T03:28:15.1008640Z maxThreadsPerBlock:            512
2024-03-08T03:28:15.1008861Z maxThreadsDim.x:               512

Is this the iGPU on which the tests fail? The tests should be failing as they try to launch kernels over the supported thread block size (1000 vs. 512):

TEMPLATE_TEST_CASE("ABM_AddKernel_MultiTypeMultiSize", "", int, long, float, long long, double) {
  auto size = GENERATE(as<size_t>{}, 100, 500, 1000);
  // (...)
  hipLaunchKernelGGL(add<TestType>, 1, size, 0, 0, d_a, d_b, d_c, size);
  HIP_CHECK(hipGetLastError());

I see the tests failing on the main branch on my iGPU where thread block sizes limited to 256. If these tests are passing on the main branch on the CI’s iGPU - why are they passing in the first place?

pvelesko commented 6 months ago

@linehill tests are limited to 500 max so they don't fail on main for me not should they be failing. The test in question:

TEMPLATE_TEST_CASE("ABM_AddKernel_MultiTypeMultiSize", "", int, long, float, long long, double) {
  auto size = GENERATE(as<size_t>{}, 100, 500);
  TestType *d_a, *d_b, *d_c;
  auto res = hipMalloc(&d_a, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);
  res = hipMalloc(&d_b, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);
  res = hipMalloc(&d_c, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);

  std::vector<TestType> a, b, c;
  for (size_t i = 0; i < size; i++) {
    a.push_back(i + 1);
    b.push_back(i + 1);
    c.push_back(2 * (i + 1));
  }

  res = hipMemcpy(d_a, a.data(), sizeof(TestType) * size, hipMemcpyHostToDevice);
  REQUIRE(res == hipSuccess);
  res = hipMemcpy(d_b, b.data(), sizeof(TestType) * size, hipMemcpyHostToDevice);
  REQUIRE(res == hipSuccess);

  hipLaunchKernelGGL(add<TestType>, 1, size, 0, 0, d_a, d_b, d_c, size);
  HIP_CHECK(hipGetLastError());

  res = hipMemcpy(a.data(), d_c, sizeof(TestType) * size, hipMemcpyDeviceToHost);
  REQUIRE(res == hipSuccess);

  HIP_CHECK(hipFree(d_a));
  HIP_CHECK(hipFree(d_b));
  HIP_CHECK(hipFree(d_c));
  REQUIRE(a == c);
}

You probably looked in hip-tests which is not enabled by default.

pvelesko commented 6 months ago

iGPU where thread block sizes limited to 256

you just pasted a snippet where it says 512, not 256.

pvelesko commented 6 months ago

Is this the iGPU on which the tests fail?

yes, all tests that fail, fail on the iGPU OpenCL backend.

linehill commented 6 months ago

iGPU where thread block sizes limited to 256

you just pasted a snippet where it says 512, not 256.

512 is figure from CI's test log.

You probably looked in hip-tests which is not enabled by default.

I see, I thought chipStar switched to hip-tests.

linehill commented 6 months ago

No new changes, just rebase.

CHIP-SPV / chipStar

Changes to reduce kernel launch overheads #794