No implementation with a GPU backend allow: Q.parallel_for(std::numerical_limits<size_t>::max(), ...);,
when the same code works fine with OpenMP:
#pragma omp target team distribute parallel for
for (size_t i = 0; i < std::numerical_limits<size_t>::max(), i++)
{}
The SYCL spec said that this code is valid and should be executed. We don't have a concept of maximum iteration space.
TLDR
I propose that:
We add a CTS test for Q.parallel_for(std::numerical_limits<size_t>::max(), ...);
We add a max_global_work_groups to specify the number of maximum work-group into the nd_range. Intel Extension.
Too Much Info and Text
SYCL
The current SYCL behavior can explained as in sycl is natural to implement parrallel_for as a direct kernel submission using the native backend. And some GPU native backends have a restriction on the number of work-item that can be submitted. For example, both L0 and CUDA have a maximum WorkGrounpCount, and WorkGroupSize. OpenCL have also a context of maxinunglobal_work_size:
global_work_size
Points to an array of work_dim unsigned values that describe the number of global work-items in work_dim dimensions that will execute the kernel function. The total number of global work-items is computed as global_work_size[0] *...* global_work_size[work_dim - 1].
The values specified in global_work_size cannot exceed the range given by the sizeof(size_t) for the device on which the kernel execution will be enqueued. The sizeof(size_t) for a device can be determined using CL_DEVICE_ADDRESS_BITS in the table of OpenCL Device Queries for [clGetDeviceInfo](https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html). If, for example, CL_DEVICE_ADDRESS_BITS = 32, i.e. the device uses a 32-bit address space, size_t is a 32-bit unsigned integer and global_work_size values must be in the range 1 .. 2^32 - 1. Values outside this range return a CL_OUT_OF_RESOURCES error.
OpenMP
In OpenMP, the "traditional" code gen will add an internal loop inside the kernel submitted so the iteration space are independent of the number of work-item. Just to be clear, the code previous user code will look like something like:
__global__ void foo(){
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
// Some trip-count + offset comoutation
for (size_t k=N; k=M )
__user_kernel()
}
int main() {
foo<<<num_team,num_thread>>>();
return 0;
}
Problem and Solution
So in short, the current implementation cannot submit a kernel with a large iteration space. 2 Main Solutions:
1/ Implementers should just fix their buggy implementation. We should add a test in the CTF
2/ We should add a concept of maximum group count to SYCL. Similar to this Intel Extension.
1/ Not your problem
I like 1/, as I like to think of SYCL as a High-Level Abstraction model. As in OpenMP, people should not be concerned about those low-level trivial detail. But, I heard that people like performance.
2/ Handling large problem sizes add overhead
The argument in favor of 2/, is that some native-backends have limitations. Any effort to avoid that limitation will have some overhead. We want SYCL to be a thin-abstraction 0 overhead layer. Hence our goal as SYCL is just to standardize general queries. Then people can query for those limitations and deal with them appropriately.
3/ Middle ground
We can allow abbriraty iteration space for range and add a query for the maximum number of work-group fornd_range. This sound like a reasonable tradeoff.
Context
No implementation with a GPU backend allow:
Q.parallel_for(std::numerical_limits<size_t>::max(), ...);
, when the same code works fine with OpenMP:The SYCL spec said that this code is valid and should be executed. We don't have a concept of
maximum iteration space
.TLDR
I propose that:
Q.parallel_for(std::numerical_limits<size_t>::max(), ...);
max_global_work_groups
to specify the number of maximum work-group into thend_range
. Intel Extension.Too Much Info and Text
SYCL
The current SYCL behavior can explained as in sycl is natural to implement
parrallel_for
as a direct kernel submission using the native backend. And some GPU native backends have a restriction on the number of work-item that can be submitted. For example, both L0 and CUDA have a maximum WorkGrounpCount, and WorkGroupSize. OpenCL have also a context of maxinunglobal_work_size
:OpenMP
In OpenMP, the "traditional" code gen will add an internal loop inside the kernel submitted so the
iteration space
are independent of the number of work-item. Just to be clear, the code previous user code will look like something like:Problem and Solution
So in short, the current implementation cannot submit a kernel with a large iteration space. 2 Main Solutions:
1/ Implementers should just fix their buggy implementation. We should add a test in the CTF 2/ We should add a concept of
maximum group count
to SYCL. Similar to this Intel Extension.1/ Not your problem
I like 1/, as I like to think of SYCL as a High-Level Abstraction model. As in OpenMP, people should not be concerned about those low-level trivial detail. But, I heard that people like performance.
2/ Handling large problem sizes add overhead
The argument in favor of 2/, is that some native-backends have limitations. Any effort to avoid that limitation will have some overhead. We want SYCL to be a thin-abstraction 0 overhead layer. Hence our goal as SYCL is just to standardize general queries. Then people can query for those limitations and deal with them appropriately.
3/ Middle ground
We can allow abbriraty iteration space for
range
and add a query for the maximum number of work-group fornd_range
. This sound like a reasonable tradeoff.