KhronosGroup / SYCL-Docs

SYCL Open Source Specification
Other
114 stars 67 forks source link

Iteration Size Limitation? #386

Open TApplencourt opened 1 year ago

TApplencourt commented 1 year ago

Context

No implementation with a GPU backend allow: Q.parallel_for(std::numerical_limits<size_t>::max(), ...);, when the same code works fine with OpenMP:

#pragma omp target team distribute parallel for
for (size_t i = 0; i <  std::numerical_limits<size_t>::max(), i++)
{}

The SYCL spec said that this code is valid and should be executed. We don't have a concept of maximum iteration space.

TLDR

I propose that:

Too Much Info and Text

SYCL

The current SYCL behavior can explained as in sycl is natural to implement parrallel_for as a direct kernel submission using the native backend. And some GPU native backends have a restriction on the number of work-item that can be submitted. For example, both L0 and CUDA have a maximum WorkGrounpCount, and WorkGroupSize. OpenCL have also a context of maxinunglobal_work_size:

global_work_size

    Points to an array of work_dim unsigned values that describe the number of global work-items in work_dim dimensions that will execute the kernel function. The total number of global work-items is computed as global_work_size[0] *...* global_work_size[work_dim - 1].

    The values specified in global_work_size cannot exceed the range given by the sizeof(size_t) for the device on which the kernel execution will be enqueued. The sizeof(size_t) for a device can be determined using CL_DEVICE_ADDRESS_BITS in the table of OpenCL Device Queries for [clGetDeviceInfo](https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html). If, for example, CL_DEVICE_ADDRESS_BITS = 32, i.e. the device uses a 32-bit address space, size_t is a 32-bit unsigned integer and global_work_size values must be in the range 1 .. 2^32 - 1. Values outside this range return a CL_OUT_OF_RESOURCES error.

OpenMP

In OpenMP, the "traditional" code gen will add an internal loop inside the kernel submitted so the iteration space are independent of the number of work-item. Just to be clear, the code previous user code will look like something like:

__global__ void foo(){
   int gindex = threadIdx.x + blockIdx.x * blockDim.x;
    // Some trip-count + offset comoutation
    for (size_t k=N; k=M )
        __user_kernel()
}

int main() {
   foo<<<num_team,num_thread>>>(); 
    return 0;
}

Problem and Solution

So in short, the current implementation cannot submit a kernel with a large iteration space. 2 Main Solutions:

1/ Implementers should just fix their buggy implementation. We should add a test in the CTF 2/ We should add a concept of maximum group count to SYCL. Similar to this Intel Extension.

1/ Not your problem

I like 1/, as I like to think of SYCL as a High-Level Abstraction model. As in OpenMP, people should not be concerned about those low-level trivial detail. But, I heard that people like performance.

2/ Handling large problem sizes add overhead

The argument in favor of 2/, is that some native-backends have limitations. Any effort to avoid that limitation will have some overhead. We want SYCL to be a thin-abstraction 0 overhead layer. Hence our goal as SYCL is just to standardize general queries. Then people can query for those limitations and deal with them appropriately.

3/ Middle ground

We can allow abbriraty iteration space for range and add a query for the maximum number of work-group fornd_range. This sound like a reasonable tradeoff.

tomdeakin commented 1 year ago