Investigate the use of SYCL

fwyzard commented 5 years ago

SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++. SYCL single-source programming enables the host and kernel code for an application to be contained in the same source file, in a type-safe way and with the simplicity of a cross-platform asynchronous task graph.

Specifications:

SYCL 1.2.1 specification, revision 5 (Apr 18, 2019)
SYCL 2.2 specification, provisional (February 2016), deprecated

Implementations

CodePlay's ComputeCpp
hipSYCL
triSYCL
Intel's LLVM fork

fwyzard commented 5 years ago

Comments based on a first (ongoing) reading of the specification, version 1.2.1 revision 5:

a command group can contain only one kernel, together with multiple accessors that identify its prerequisites:

3.4.1.2 A SYCL command group object defines a set of requisites (R) and a kernel function (k).
a host accessor is blocking:

3.5.1 Host accessors are a special type of accessor constructed from a memory object outside a command group, and require that the data associated with the given memory object is available on the host in the given pointer. This causes the runtime to block on construction of this object until the requirement has been satisfied. Host accessor objects are effectively barriers on all accesses to a certain memory object.

Also

3.6.9 The host accessor does not necessarily copy back to the same host memory as initially given by the user"

So it doesn't seem possible to support system-wide atomic operations (e.g. between the host and the device), something that CUDA supports starting from Pascal (sm 6.x GPU) and Xavier (sm 7.2 SoC) according to the documentation.
local memory (i.e. group-shared memory) can be allocated within the kernel or defined from the host:

3.5.2.1 To allocate local memory within a kernel, the user can either pass a cl::sycl::local_accessor object to the kernel as a parameter, or can define a variable in workgroup scope inside cl::sycl::parallel_for_work_group.

The second approach however forces one to use only implicit barriers via parallel_for_work_item scopes, as the cl::sycl::h_item (4.8.1.7) and cl::sycl::group (4.8.1.8) objects do not have a barrier() method (open an issue at https://github.com/KhronosGroup/SYCL-Docs/issues ?). Also, see next point.
Hierarchical data parallel kernels allow for more explicit distinction of the per-thread and per-group instructions and (shared) variables; however the per-thread variables have a scope limited to each individual parallel loop:

3.6.3 All code within the parallel_for_work_group scope effectively executes once per work-group. Within the parallel_for_work_group scope, it is possible to call parallel_for_work_item which creates a new scope in which all work-items within the current work-group execute. [...] All variables declared inside the parallel_for_work_group scope are allocated in workgroup local memory, whereas all variables declared inside the parallel_for_work_item scope are declared in private memory.
It is not possible to synchronise only a subset of the threads

3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.

According to the documentation CUDA Cooperative Groups allow for finer and grid-wise granularity.
3.10 kernels cannot include RTTI information, exception classes, recursive code, virtual functions

CUDA does support recursive functions and virtual functions.
3.10 Sharing data structures between host and device code imposes certain restrictions, such as use of only user defined classes that are C++11 standard layout classes for the data structures, and in general, no pointers initialized for the host can be used on the device. The only way of passing pointers to a kernel is through the cl::sycl::accessor class, which supports the cl::sycl::buffer and cl::sycl::image classes. No hierarchical structures of these classes are supported and any other data containers need to be converted to the SYCL data management classes using the SYCL interface.

CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory. Also

4.7.2 A buffer does not map to only one OpenCL buffer object, and all OpenCL buffer memory objects may be temporary for use within a command group on a specific device.
4.7.2 The only exception to this rule is when a buffer is constructed from a cl_mem object to interoperate with OpenCL.

Could this be (ab)used to guarantee "stable" device pointers ?
How would one implement a SoA with SYCL ? As a "scalar" buffer with a single element of variable size ?
Seems the "sub-groups" and "device-side enqueue" were supposed to be in SYCL 2.2 ...

fwyzard commented 5 years ago

@makortel FYI

makortel commented 5 years ago

Thanks. Below I'm mostly thinking out loud.

3.6.9 The host accessor does not necessarily copy back to the same host memory as initially given by the user"

So it doesn't seem possible to support concurrent, atomic operations between the host and the device (does CUDA managed memory support them ?)

I don't know, but I really hope we don't need them (sounds like potential slowdown).

3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.

Does CUDA support partial synchronization within cooperative groups ?

Does __syncthreads() as a barrier for threads in a block count?

3.10 Sharing data structures between host and device code imposes certain restrictions, such as use of only user defined classes that are C++11 standard layout classes for the data structures, and in general, no pointers initialized for the host can be used on the device. ...

CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory.

I'm hoping we would not need such data structures, but I can also imagine we could easily have cases where such structures would be needed. To me this point is sort of two-edged sword: on one hand it is restrictive, on the other hand, I suppose SYCL would be the way for us to run on certain GPUs so if we want to do that we would have to accept this restriction.

Further OTOH, if we would use "higher-level" abstraction than SYCL without such a restriction for non-SYCL backends, we could easily start with SYCL-needed HW by just dropping out those modules needing hierarchical structures.

fwyzard commented 5 years ago

So it doesn't seem possible to support concurrent, atomic operations between the host and the device (does CUDA managed memory support them ?)

I don't know, but I really hope we don't need them (sounds like potential slowdown).

According to the documentation CUDA supports system-wide atomic operations, starting from Pascal (sm 6.x GPU) and Xavier (sm 7.2 SoC):

Compute capability 6.x introduces new type of atomics which allows developers to widen or narrow the scope of an atomic operation. For example, atomicAdd_system guarantees that the instruction is atomic with respect to other CPUs and GPUs in the system.

3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.

Does CUDA support partial synchronization within cooperative groups ?

Does __syncthreads() as a barrier for threads in a block count?

That corresponds to the SYCL workgroup barrier.

According to the documentation cooperative groups should allow for different granularity. Unfortunately the documentation is a bit vague, so it's not clear for example if this is allowed

if (...) {
    auto active = coalesced_threads();
    ...
    active.sync();
}

CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory.

I'm hoping we would not need such data structures, but I can also imagine we could easily have cases where such structures would be needed. To me this point is sort of two-edged sword: on one hand it is restrictive, on the other hand, I suppose SYCL would be the way for us to run on certain GPUs so if we want to do that we would have to accept this restriction.

it seems Intel is adding some extensions to SYCL for its own compiler and gpus: https://github.com/intel/llvm/blob/sycl/sycl/ReleaseNotes.md . For example:

Raw pointers capturing added to the SYCL device front-end compiler. This capability is required for Unified Shared Memory feature implementation.

New attributes for Intel FPGA device are added [...]

So our baseline may actually be a superset of SYCL 1.2.1 (or a new SYCL version).

makortel commented 5 years ago

Thanks for the clarifications.

it seems Intel is adding some extensions to SYCL for its own compiler and gpus: https://github.com/intel/llvm/blob/sycl/sycl/ReleaseNotes.md . For example:

Raw pointers capturing added to the SYCL device front-end compiler. This capability is required for Unified Shared Memory feature implementation.

New attributes for Intel FPGA device are added [...]

So our baseline may actually be a superset of SYCL 1.2.1 (or a new SYCL version).

Interesting. Makes me feel even stronger that for time being it might be better to not commit on SYCL for all platforms but to keep it specific Intel. (and adjust if/when the landscape changes)

fwyzard commented 5 years ago

Some more details:

Intel draft extension for Unifies Shared Memory (USM) in SYCL: https://github.com/intel/llvm/blob/2019-08/sycl/doc/extensions/usm/usm.adoc
Intel draft extension for Unifies Shared Memory (USM) in OpenCL 2.2.x: https://github.com/intel/llvm/blob/2019-08/sycl/doc/extensions/usm/cl_intel_unified_shared_memory.asciidoc
Intel documentation for Shared Virtual Memory (SVM) in OpenCL 2.0: https://software.intel.com/en-us/articles/opencl-20-shared-virtual-memory-overview

I have not read them, but it looks like Intel's SYCL will have pointers and the equivalent of CUDA Unified Memory ...

fwyzard commented 4 years ago

Other useful extensions for us

Ordered Queues: https://github.com/intel/llvm/blob/2019-08/sycl/doc/extensions/ordered_queue/ordered_queue.adoc These offer a similar interface as CUDA streams: we can just queue memory operations and kernels, and they should execute sequentially (on the device) and asynchronously (with respect to the host). I suspect they may try to give a stronger guarantee about the ordering and pay some performance for that; we can probably check once the CUDA implementation is fully available.

fwyzard commented 3 years ago

In progress in the pixel track standalone code base:

a non optimised and possibly incomplete version of the "GPU framework" can be found in the sycltest directory of the main repository
implementation of the modules is in progress in the sycl directory in this sycl branch

cms-patatrack / cmssw

Investigate the use of SYCL #376