cms-patatrack / cmssw

CMSSW fork of the Patatrack project
https://patatrack.web.cern.ch/patatrack/index.html
Apache License 2.0
2 stars 5 forks source link

Investigate the use of SYCL #376

Open fwyzard opened 5 years ago

fwyzard commented 5 years ago

From [https://www.khronos.org/sycl/]:

SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++. SYCL single-source programming enables the host and kernel code for an application to be contained in the same source file, in a type-safe way and with the simplicity of a cross-platform asynchronous task graph.

Specifications:

Implementations

fwyzard commented 5 years ago

Comments based on a first (ongoing) reading of the specification, version 1.2.1 revision 5:

fwyzard commented 5 years ago

@makortel FYI

makortel commented 5 years ago

Thanks. Below I'm mostly thinking out loud.

3.6.9 The host accessor does not necessarily copy back to the same host memory as initially given by the user"

So it doesn't seem possible to support concurrent, atomic operations between the host and the device (does CUDA managed memory support them ?)

I don't know, but I really hope we don't need them (sounds like potential slowdown).

3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.

Does CUDA support partial synchronization within cooperative groups ?

Does __syncthreads() as a barrier for threads in a block count?

3.10 Sharing data structures between host and device code imposes certain restrictions, such as use of only user defined classes that are C++11 standard layout classes for the data structures, and in general, no pointers initialized for the host can be used on the device. ...

CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory.

I'm hoping we would not need such data structures, but I can also imagine we could easily have cases where such structures would be needed. To me this point is sort of two-edged sword: on one hand it is restrictive, on the other hand, I suppose SYCL would be the way for us to run on certain GPUs so if we want to do that we would have to accept this restriction.

Further OTOH, if we would use "higher-level" abstraction than SYCL without such a restriction for non-SYCL backends, we could easily start with SYCL-needed HW by just dropping out those modules needing hierarchical structures.

fwyzard commented 5 years ago

So it doesn't seem possible to support concurrent, atomic operations between the host and the device (does CUDA managed memory support them ?)

I don't know, but I really hope we don't need them (sounds like potential slowdown).

According to the documentation CUDA supports system-wide atomic operations, starting from Pascal (sm 6.x GPU) and Xavier (sm 7.2 SoC):

Compute capability 6.x introduces new type of atomics which allows developers to widen or narrow the scope of an atomic operation. For example, atomicAdd_system guarantees that the instruction is atomic with respect to other CPUs and GPUs in the system.

3.6.5.2 Synchronization between work-items in a single work-group is achieved using a work-group barrier. [...] Note that the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all.

Does CUDA support partial synchronization within cooperative groups ?

Does __syncthreads() as a barrier for threads in a block count?

That corresponds to the SYCL workgroup barrier.

According to the documentation cooperative groups should allow for different granularity. Unfortunately the documentation is a bit vague, so it's not clear for example if this is allowed

if (...) {
    auto active = coalesced_threads();
    ...
    active.sync();
}

CUDA definitely supports hierarchical structures based on pointers, either via a chain of cudaMalloc calls, or via managed memory.

I'm hoping we would not need such data structures, but I can also imagine we could easily have cases where such structures would be needed. To me this point is sort of two-edged sword: on one hand it is restrictive, on the other hand, I suppose SYCL would be the way for us to run on certain GPUs so if we want to do that we would have to accept this restriction.

it seems Intel is adding some extensions to SYCL for its own compiler and gpus: https://github.com/intel/llvm/blob/sycl/sycl/ReleaseNotes.md . For example:

  • Raw pointers capturing added to the SYCL device front-end compiler. This capability is required for Unified Shared Memory feature implementation.
  • New attributes for Intel FPGA device are added [...]

So our baseline may actually be a superset of SYCL 1.2.1 (or a new SYCL version).

makortel commented 5 years ago

Thanks for the clarifications.

it seems Intel is adding some extensions to SYCL for its own compiler and gpus: https://github.com/intel/llvm/blob/sycl/sycl/ReleaseNotes.md . For example:

  • Raw pointers capturing added to the SYCL device front-end compiler. This capability is required for Unified Shared Memory feature implementation.
  • New attributes for Intel FPGA device are added [...]

So our baseline may actually be a superset of SYCL 1.2.1 (or a new SYCL version).

Interesting. Makes me feel even stronger that for time being it might be better to not commit on SYCL for all platforms but to keep it specific Intel. (and adjust if/when the landscape changes)

fwyzard commented 5 years ago

Some more details:

I have not read them, but it looks like Intel's SYCL will have pointers and the equivalent of CUDA Unified Memory ...

fwyzard commented 4 years ago

Other useful extensions for us

fwyzard commented 3 years ago

In progress in the pixel track standalone code base: