Open fwyzard opened 5 years ago
Comments based on a first (ongoing) reading of the specification, version 1.2.1 revision 5:
a command group can contain only one kernel, together with multiple accessors that identify its prerequisites:
3.4.1.2 A SYCL command group object defines a set of requisites (R) and a kernel function (k).
a host accessor is blocking:
3.5.1 Host accessors are a special type of accessor constructed from a memory object outside a command group, and require that the data associated with the given memory object is available on the host in the given pointer. This causes the runtime to block on construction of this object until the requirement has been satisfied. Host accessor objects are effectively barriers on all accesses to a certain memory object.
Also
3.6.9 The host accessor does not necessarily copy back to the same host memory as initially given by the user"
So it doesn't seem possible to support system-wide atomic operations (e.g. between the host and the device), something that CUDA supports starting from Pascal (sm 6.x GPU) and Xavier (sm 7.2 SoC) according to the documentation.
local memory (i.e. group-shared memory) can be allocated within the kernel or defined from the host:
3.5.2.1 To allocate local memory within a kernel, the user can either pass a cl::sycl::local_accessor object to the kernel as a parameter, or can define a variable in workgroup scope inside cl::sycl::parallel_for_work_group.
The second approach however forces one to use only implicit barriers via parallel_for_work_item
scopes, as the cl::sycl::h_item
(4.8.1.7) and cl::sycl::group
(4.8.1.8) objects do not have a barrier()
method (open an issue at https://github.com/KhronosGroup/SYCL-Docs/issues ?). Also, see next point.
Hierarchical data parallel kernels allow for more explicit distinction of the per-thread and per-group instructions and (shared) variables; however the per-thread variables have a scope limited to each individual parallel loop:
3.6.3 All code within the
parallel_for_work_group
scope effectively executes once per work-group. Within theparallel_for_work_group
scope, it is possible to callparallel_for_work_item
which creates a new scope in which all work-items within the current work-group execute. [...] All variables declared inside theparallel_for_work_group
scope are allocated in workgroup local memory, whereas all variables declared inside theparallel_for_work_item
scope are declared in private memory.
It is not possible to synchronise only a subset of the threads
3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.
According to the documentation CUDA Cooperative Groups allow for finer and grid-wise granularity.
3.10 kernels cannot include RTTI information, exception classes, recursive code, virtual functions
CUDA does support recursive functions and virtual functions.
3.10 Sharing data structures between host and device code imposes certain restrictions, such as use of only user defined classes that are C++11 standard layout classes for the data structures, and in general, no pointers initialized for the host can be used on the device. The only way of passing pointers to a kernel is through the cl::sycl::accessor class, which supports the cl::sycl::buffer and cl::sycl::image classes. No hierarchical structures of these classes are supported and any other data containers need to be converted to the SYCL data management classes using the SYCL interface.
CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory. Also
4.7.2 A buffer does not map to only one OpenCL buffer object, and all OpenCL buffer memory objects may be temporary for use within a command group on a specific device.
4.7.2 The only exception to this rule is when a buffer is constructed from a cl_mem object to interoperate with OpenCL.
Could this be (ab)used to guarantee "stable" device pointers ?
How would one implement a SoA with SYCL ? As a "scalar" buffer with a single element of variable size ?
Seems the "sub-groups" and "device-side enqueue" were supposed to be in SYCL 2.2 ...
@makortel FYI
Thanks. Below I'm mostly thinking out loud.
3.6.9 The host accessor does not necessarily copy back to the same host memory as initially given by the user"
So it doesn't seem possible to support concurrent, atomic operations between the host and the device (does CUDA managed memory support them ?)
I don't know, but I really hope we don't need them (sounds like potential slowdown).
3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.
Does CUDA support partial synchronization within cooperative groups ?
Does __syncthreads()
as a barrier for threads in a block count?
3.10 Sharing data structures between host and device code imposes certain restrictions, such as use of only user defined classes that are C++11 standard layout classes for the data structures, and in general, no pointers initialized for the host can be used on the device. ...
CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory.
I'm hoping we would not need such data structures, but I can also imagine we could easily have cases where such structures would be needed. To me this point is sort of two-edged sword: on one hand it is restrictive, on the other hand, I suppose SYCL would be the way for us to run on certain GPUs so if we want to do that we would have to accept this restriction.
Further OTOH, if we would use "higher-level" abstraction than SYCL without such a restriction for non-SYCL backends, we could easily start with SYCL-needed HW by just dropping out those modules needing hierarchical structures.
So it doesn't seem possible to support concurrent, atomic operations between the host and the device (does CUDA managed memory support them ?)
I don't know, but I really hope we don't need them (sounds like potential slowdown).
According to the documentation CUDA supports system-wide atomic operations, starting from Pascal (sm 6.x GPU) and Xavier (sm 7.2 SoC):
Compute capability 6.x introduces new type of atomics which allows developers to widen or narrow the scope of an atomic operation. For example,
atomicAdd_system
guarantees that the instruction is atomic with respect to other CPUs and GPUs in the system.3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.
Does CUDA support partial synchronization within cooperative groups ?
Does
__syncthreads()
as a barrier for threads in a block count?
That corresponds to the SYCL workgroup barrier.
According to the documentation cooperative groups should allow for different granularity. Unfortunately the documentation is a bit vague, so it's not clear for example if this is allowed
if (...) {
auto active = coalesced_threads();
...
active.sync();
}
CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory.
I'm hoping we would not need such data structures, but I can also imagine we could easily have cases where such structures would be needed. To me this point is sort of two-edged sword: on one hand it is restrictive, on the other hand, I suppose SYCL would be the way for us to run on certain GPUs so if we want to do that we would have to accept this restriction.
it seems Intel is adding some extensions to SYCL for its own compiler and gpus: https://github.com/intel/llvm/blob/sycl/sycl/ReleaseNotes.md . For example:
- Raw pointers capturing added to the SYCL device front-end compiler. This capability is required for Unified Shared Memory feature implementation.
- New attributes for Intel FPGA device are added [...]
So our baseline may actually be a superset of SYCL 1.2.1 (or a new SYCL version).
Thanks for the clarifications.
it seems Intel is adding some extensions to SYCL for its own compiler and gpus: https://github.com/intel/llvm/blob/sycl/sycl/ReleaseNotes.md . For example:
- Raw pointers capturing added to the SYCL device front-end compiler. This capability is required for Unified Shared Memory feature implementation.
- New attributes for Intel FPGA device are added [...]
So our baseline may actually be a superset of SYCL 1.2.1 (or a new SYCL version).
Interesting. Makes me feel even stronger that for time being it might be better to not commit on SYCL for all platforms but to keep it specific Intel. (and adjust if/when the landscape changes)
Some more details:
I have not read them, but it looks like Intel's SYCL will have pointers and the equivalent of CUDA Unified Memory ...
Other useful extensions for us
In progress in the pixel track standalone code base:
sycltest
directory of the main repositorysycl
directory in this sycl branch
From [https://www.khronos.org/sycl/]:
Specifications:
Implementations