j-stephan commented 5 years ago

I am currently evaluating whether AMD's HC and Khronos' SYCL are suitable to be used as backends for alpaka. (@tdd11235813 is my supervisor in case someone wonders who I am.) This is the mapping I have come up with so far (based on the CUDA mapping available in doc directory). Please feel free to discuss this map or to request feature mappings that are needed but not monitored here.

General Remarks

All alpaka names are defined in the alpaka namespace.
All HC names are defined in the hc namespace.
All SYCL names are defined in the cl::sycl namespace.
Edit 06 June 2019: As of June 2019 HC is officially deprecated as AMD wants to focus on HIP.

Programming Interface

Function Attributes

alpaka	CUDA	HC	SYCL
`ALPAKA_FN_HOST`	`__host__ void f()`	~~`void f() [[cpu]]`~~	no equivalent
`ALPAKA_FN_ACC`	`__device__ void f()`	~~`void f() [[hc]]`~~	no equivalent
`ALPAKA_FN_ACC`	`__global__ void f()`	~~`void f() [[hc]]`~~	no equivalent
`ALPAKA_FN_HOST_ACC`	`__host__ __device__ void f()`	~~`void f() [[hc]] [[cpu]]`~~	no equivalent

Memory

alpaka	CUDA	HC	SYCL
`block::shared::st::allocVar<uint32_t, __COUNTER__>(acc)`	`__shared__`	~~`tile_static`~~	`access::target::local`
`ALPAKA_STATIC_ACC_MEM_CONSTANT`	`__constant__`	~~implied~~	`access::target::constant_buffer`
`ALPAKA_STATIC_ACC_MEM_GLOBAL`	`__device__`	~~unknown - global var?~~	unknown - global var?

Index / Work Division

alpaka	CUDA	HC	SYCL
`idx::getIdx<Block, Threads>(acc)`	`threadIdx.{x,y,z}`	~~`t_idx.local[dim]`~~	`nd_item.get_local_id(dim)`
`idx::getIdx<Grid, Blocks>(acc)`	`blockIdx.{x,y,z}`	~~`t_idx.tile[dim]`~~	`nd_item.get_group(dim)`
`workdiv::getWorkDiv<Block, Threads>(acc)`	`blockDim.{x,y,z}`	~~`t_ext.tile_dim[dim]`~~	`nd_item.get_local_range(dim)`
`workdiv::getWorkDiv<Grid, Blocks>(acc)`	`gridDim.{x,y,z}`	~~`t_ext[dim]`~~	`nd_item.get_group_range(dim)`

Types

alpaka	CUDA	HC	SYCL
`vec::Vec<TDim, TVal>`	`vec3`	~~`extent<dim>` or `tiled_extent<dim>`~~	`range<dim>` or `nd_range<dim>`

Runtime API

Device Management

alpaka	CUDA	HC	SYCL
`dev::reset(device)`	`cudaDeviceReset()`	~~no equivalent~~	`device` leaves scope
`wait::wait(device)`	`cudaDeviceSynchronize()`	~~`accelerator_view::wait()`~~	needs workaround
`size_t pltf::getDevCount<TPltf>()`	`int cudaGetDeviceCount()`	~~`vector<accelerator> accelerator::get_all()`~~	`vector<device> platform::get_devices()` or `vector<device> context::get_devices()`
`acc::getAccDevProps(dev)`	`cudaGetDeviceProperties()`	`accelerator_view::get_PROPERTY`*	`device::get_info()`

* HC accelerators have getter functions for each queryable property.

Error Handling

alpaka, HC and SYCL use exceptions, these should be easily convertible. CUDA uses a C-style error return type.

Queue Management

alpaka	CUDA	HC	SYCL
`queue::enqueue(queue, [](){do_something();})`	`cudaStreamAddCallback()`	~~`auto future = parallel_for_each(accelerator_view, extent<dim>{...}, []() { do_something();})`~~	`queue::submit(T cgf)`*
`queue = queue::QueueCudaRtAsync(device);`	`cudaStreamCreate()`	~~`accelerator_view = accelerator::get_default_view()`~~	`queue` constructor
`queue = queue::QueueCudaRtSync(device);`	`cudaStreamCreate()`	~~If you want synchronous behaviour, leave out the `auto future =` when enqueuing the kernel. The `future`'s destructor will automatically synchronize.~~	`queue` is async by default. Call `queue::wait_and_throw()` for synchronization.
`bool queue::empty(queue)`	`cudaStreamQuery()`	~~`completion_future::is_ready()`~~	needs emulation
`void wait::wait(queue)`	`cudaStreamSynchronize()`	`completion_future::wait()` `completion_future::wait_for(std::chrono::duration)` `completion_future::wait_until(std::chrono::time_point)`	`queue::wait_and_throw()`
`void wait::wait(queue, event)`	`cudaStreamWaitEvent()`	~~needs workaround~~	needs workaround

* T cgf refers to a Command Group Function object and combines the actual kernel with its requirements (e.g. memcpys). This function object has the following signature: (handler& cgh) and can be an operator() or a Lambda.

Event Management

~~HC does not have a concept of events. A similar behavior can be achieved by using the completion_future returned by parallel_for_each() which mirrors std::future's behavior.~~

alpaka	CUDA	SYCL
`auto event = event::Event<TQueue>{dev};`	`cudaEventCreate()`	no equivalent
`bool event::test(event)`	`cudaEventQuery()`	`info::event_command_status event::get_info<info::event::command_execution_status>()`
`void queue::enqueue(queue, event)`	`cudaEventRecord()`	`auto event = queue::submit()`
`wait::wait(event)`	`cudaEventSynchronize()`	`event::wait_and_throw()`

Memory Management

alpaka	CUDA	HC	SYCL
`mem::buf::alloc<TElement>(device, extents1D)`	`cudaMalloc()`	~~`array<T, 1>{extent<1>{size}, accelerator_view}`~~	`buffer<T>{range<1>{size}}`
`mem::buf::alloc<TElement>(device, extents3D)`	`cudaMalloc3D()`	~~`array<T, 3>{extent<3>{size0, size1, size2}, accelerator_view}`~~	`buffer<T>{range<3>{size0, size1, size2}}`
`mem::buf::alloc<TElement>(device, extents)`	`cudaMallocHost()`	`am_alloc(size, accelerator, amHostPinned)`*	needs workaround
`mem::buf::alloc<TElement>(device, extents2D)`	`cudaMallocPitch()`	~~`array<T, 2>{extent<2>{size0, size1}, accelerator_view}`~~	`buffer<T>{range<2>{size0, size1}`
`dev::getMemBytes()`	`cudaMemGetInfo()`	`accelerator::get_dedicated_memory()`**	`device::get_info<info::device::global_mem_size>()`
`dev::getFreeMemBytes()`	`cudaMemGetInfo()`	~~needs workaround~~	no equivalent
`mem::view::copy(memBufDst, memBufSrc, extents1D)`	`cudaMemcpy()`	~~`copy(src, dst)`~~	Implicit: SYCL `accessor` Explicit: `handler::copy(src, dst)` followed by `queue.wait_and_throw()`
`mem::view::copy(memBufDst, memBufSrc, extents2D)`	`cudaMemcpy2D()`	~~`copy(src, dst)`~~	Implicit: SYCL `accessor` Explicit: `handler::copy(src, dst)` followed by `queue.wait_and_throw()`
`mem::view::copy(memBufDst, memBufSrc, extents2D, queue)`	`cudaMemcpy2DAsync()`	~~`copy_async(src, dst)`~~	Implicit: SYCL `accessor` Explicit: `handler::copy(src, dst)`
`mem::view::copy(memBufDst, memBufSrc, extents3D)`	`cudaMemcpy3D()`	~~`copy(src, dst)`~~	Implicit: SYCL `accessor` Explicit: `handler::copy(src, dst)` followed by `queue.wait_and_throw()`
`mem::view::copy(memBufDst, memBufSrc, extents3D, queue)`	`cudaMemcpy3DAsync()`	~~`copy_async(src, dst)`~~	Implicit: SYCL `accessor` Explicit: `handler::copy(src, dst)`
`mem::view::copy(memBufDst, memBufSrc, extents3D)`	`cudaMemcpy3DPeer()`	`copy(src, dst)`***	no equivalent
`mem::view::copy(memBufDst, memBufSrc, extents3D, queue)`	`cudaMemcpy3DPeerAsync()`	`copy_async(src, dst)`***	no equivalent
`mem::view::copy(memBufDst, memBufSrc, extents1D, queue)`	`cudaMemcpyAsync()`	~~`copy_async(src, dst)`~~	Implicit: SYCL `accessor` Explicit: `handler::copy(src, dst)`
`mem::view::copy(memBufDst, memBufSrc, extents1D)`	`cudaMemcpyPeer()`	`copy(src, dst)`***	no equivalent
`mem::view::copy(memBufDst, memBufSrc, extents1D, queue)`	`cudaMemcpyPeerAsync()`	`copy_async(src, dst)`***	no equivalent
`mem::view::set(memBufDst, byte, extents1D)`	`cudaMemset()`	~~no equivalent~~	Implicit: use SYCL `accessor` Explicit: `handler::fill(dst, val)` followed by `queue.wait_and_throw()`
`mem::view::set(memBufDst, byte, extents2D)`	`cudaMemset2D()`	~~no equivalent~~	Implicit: use SYCL `accessor` Explicit: `handler::fill(dst, val)` followed by `queue.wait_and_throw()`
`mem::view::set(memBufDst, byte, extents2D, queue)`	`cudaMemset2DAsync()`	~~no equivalent~~	Implicit: use SYCL `accessor` Explicit: `handler::fill(dst, val)`
`mem::view::set(memBufDst, byte, extents3D)`	`cudaMemset3D()`	~~no equivalent~~	Implicit: use SYCL `accessor` Explicit: `handler::fill(dst, val)` followed by `queue.wait_and_throw()`
`mem::view::set(memBufDst, byte, extents3D, queue)`	`cudaMemset3DAsync()`	~~no equivalent~~	Implicit: use SYCL `accessor` Explicit: `handler::fill(dst, val)`
`mem::view::set(memBufDst, byte, extents1D, queue)`	`cudaMemset()`	~~no equivalent~~	Implicit: use SYCL `accessor` Explicit: `handler::fill(dst, val)` followed by `queue.wait_and_throw()`

* am_alloc returns a raw pointer that needs to be freed with am_free.

** Returns the value in KiB.

*** Assumed - see below.

Execution Control

alpaka	CUDA	HC	SYCL
`kernel::exec<TAcc>(queue, workDiv, kernel, params...)` `kernel::BlockSharedExternMemSizeBytes<TKernel<TAcc>>::getBlockSharedExternMemSizeBytes<...>(...)`	`cudaLaunchKernel()`	~~`parallel_for_each(accelerator_view, extent, kernel)`~~	`parallel_for(range<dim>(work_size), kernelFunc)` or `parallel_for_work_group(range<dim>(work_size), range<dim>(block_size), kernelFunc)`

Peer Device Memory Access

~~While HC accelerators know their peers, there are no explicit API functions to access peer memory. One can assume that P2P copies are done implicitly but this requires further testing.~~

The SYCL specification does not even mention the work "peer". Some implementations might implicitly do P2P copies if available but this requires further testing.

Edit (2018-11-07): Fixed some typos and formatting.

tdd11235813 commented 5 years ago

awesome work, thanks for the very helpful rundown. Now we can check for some possible show stopper and clumsy workarounds. For HC another question would be, whether we would gain something from it compared to HIP. SYCL would be very interesting for us, although it is not clear at the moment, whether triSYCL's extensions to SYCL for FPGAs are mandatory (@j-stephan correct me, if I am wrong). Maybe workarounds or specializations for FPGA code parts can help though.

j-stephan commented 5 years ago

I've just had a look at the extensions that are bundled with Xilinx' triSYCL implementation.

They introduce a new array type that can make use of different FPGA memory configurations and thus enable higher memory bandwidth.
There are also some "function decorators" that handle dataflow and execution. The documentation is not very verbose about these decorators, but it seems like functions can be specified to execute in a "pipelined" or a "dataflow" manner (whatever that is). I assume this is something specific to FPGAs as well.
There are a few intrinsics.

All in all it looks like using these extensions is not mandatory, but probably a good idea in order to achieve better performance.

ax3l commented 5 years ago

Btw, you can vote for SYCL support in MSVS here: https://developercommunity.visualstudio.com/idea/490213/sycl-support.html

j-stephan commented 5 years ago

I am currently facing some issues with clashing concepts between SYCL and Alpaka. I had some offline discussion with @tdd11235813 about this today, too.

The main problem we both see is Alpaka's reliance on raw pointers:

struct Kernel {
/* ... */
    template</*...*/>
    ALPAKA_FN_ACC auto operator()(
        TAcc const & acc,
        TElem const * const A, /* ... */
};

auto const taskKernel(alpaka::kernel::createTaskKernel<Acc>(
        workDiv,
        kernel,
        alpaka::mem::view::getPtrNative(bufAccA), /*...*/

This is probably a CUDA inheritance and has worked well so far with the existing backends; however, passing pointers to SYCL kernels is a big no-no:

It is illegal to pass a pointer or reference argument to a kernel. -- SYCL Spec., 4.8.10

In SYCL an equivalent kernel would look like this:

template</*...*/>
struct Kernel
{
    cl::sycl::accessor<TElem, 1, cl::sycl::access::mode::read> bufAccA;

    auto operator()(cl::sycl::nd_item<1>)
    { /* Kernel function */ }
};

auto bufA = cl::sycl::buffer<TElem, 1>{/* ... */};
auto queue = cl::sycl::queue{/* ... */};

// Alternative 1
queue.enqueue([&](cl::sycl::handler& cgh)
{
    // control group lives inside this scope
    auto bufAccA1 = bufA.get_access<cl::sycl::access::mode::read>(cgh);
    auto my_kernel = Kernel{bufAccA1};

    cgh.parallel_for(cl::sycl::nd_range<1>{/* dimensions */}, my_kernel);
});

// Alternative 2
auto bufAccA2 = cl::sycl::accessor<TElem, 1,
                                    cl::sycl::access::mode::read,
                                    cl::sycl::access::target::global_buffer,
                                    cl::sycl::access::placeholder::true_t>{bufA};
queue.enqueue([&](cl::sycl::handler& cgh)
{
    // another control group lives in this scope
    cgh.require(bufAccA2); // bind buffer to this control group
    auto my_kernel = Kernel{bufAccA2};

    cgh.parallel_for(cl::sycl::nd_range<1>{/* dimensions */}, my_kernel);
});

Instead of pointers a SYCL kernel requires accessors. These are bound to a device-side buffer inside a control group (= the SYCL equivalent to Alpaka's tasks) and passed-by-value to the kernel. Internally they are encapsulating the corresponding OpenCL C pointer.

As we all know, Alpaka kernels aren't actually kernels but device-side functions. So why don't we just use the OpenCL C pointer instead and somehow hide the accessor in the backend? Because we can't:

If an accessor has the access target access::target::global_buffer, access::target::constant_buffer, access::target::local, access::target::image or access::target::image_array then it is considered a device accessor, and therefore can only be used within a SYCL kernel function [...] -- SYCL Spec., 4.7.3.6

Since the accessor's method for acquiring the OpenCL C pointer is only available when accessTarget == access::target::global_buffer or accessTarget == access::target::constant_buffer we are unable to call it from the host; Alpaka's getPtrNative() function will simply never work with SYCL. We could hide the accessor itself behind getPtrNative() but in this case the user won't actually acquire the pointer which he might expect (to do pointer arithmetics or whatever).

What we can do is cast an accessor to a pointer inside the kernel. However, I don't see a way to deduce an accessor on the host side from a given device pointer in the Alpaka signature above since it is missing size information as well as address information (global vs. shared memory for example) - nevermind that accessing this device-side pointer in any way on the host is likely completely illegal in SYCL.

I'd be thankful for any input on this; for now I'm working on mirroring SYCL's accessor concepts in Alpaka for my SYCL prototype, e.g. alpaka::mem::access::getAccess instead of alpaka::mem::view::getPtrNative. This would return an Alpaka accessor, encapsulating a SYCL accessor in my backend and (for example) a CUDA pointer in the CUDA backend.

tdd11235813 commented 5 years ago

since it is similar to OpenCL (e.g. abstract cl_mem objects), the missing piece for the cast-accessor-to-pointer-workaround would be the redefinition of the kernel signature on the host side, so we know the types of the pointer we have to cast to on the device side. In OpenCL you do this by clSetKernelArg on the host side.

In TaskKernelSycl we actually know the types passed via createTaskKernel, but when we just get a SYCL accessor, then type information is lost. A workaround could be to put that accessor in another wrapper, where we collect the type information by ourselves.

If it works, options like target, read, read_write, or write still cannot be set explicitly and not all can be derived. But in this case, const vs non-const might be enough to distiniguish read vs read-write.

Jan's current approach with the explicit Alpaka accessor design would change the kernel interface where pointers are forbidden for all backends.

tdd11235813 commented 5 years ago

Another workaround could come with function traits, so you do not need the wrapper for the accessor and get the types from the kernel function directly, which also gives you the const/non-constness of the arguments. Not sure, if this would work, never tried to implement function traits.

https://functionalcpp.wordpress.com/2013/08/05/function-traits/ https://stackoverflow.com/questions/32036556/iterating-through-function-traits-arguments/32036651#32036651

Edit: will not work cuz of the templates,see below

sbastrakov commented 5 years ago

@j-stephan thanks for a detailed explanation. I have a couple of questions regarding your code example. Sorry if they are stupid, I am not familiar with SYCL technicalities.

does Kernel miss a constructor that takes a cl::sycl::accessor<TElem, 1, cl::sycl::access::mode::read> by value and copies it to bufAccA (I suppose so)?
are bufAccA1 and bufAccA2 of the same data type between each other and with the Kernel::bufAccA?

sbastrakov commented 5 years ago

@tdd11235813 I believe function traits are probably tricky to use (or even worse) for Alpaka kernels, as their operator() are not functions, but function templates, as at least the accelerator type is normally a template parameter.

tdd11235813 commented 5 years ago

ok, I thought to instantiate the function template with the type parameter pack we get from the createTaskKernel call. In this cumbersome way you get the const/non-constness and the remaining types from the function traits.

tdd11235813 commented 5 years ago

nvm, function traits will not work cuz of the templates and cuz of the different types with accessors in the createTaskKernel call and the raw pointers in the kernel function :/

sbastrakov commented 5 years ago

That's what I originally thought. However, after some more thinking your point is probably valid and maybe with some wrapping and coding it is possible to get the info we need.

tdd11235813 commented 5 years ago

you mean to combine the two ideas, wrapper for SYCL accessor with type informations and instantiating the function template with the underlying types to construct the function traits?

getNativePtr() returns wrapper in SYCL case (can define pointer arithmetic with the wrapper)
createTaskKernel passes args to TaskKernelSycl and to the entry kernel and on probably host our wrapper types have to be parsed at compile-time, also checking for const/non-constness via instantiated function template traits, to define the accessor properties (ok, probably to late, cuz we already have an accessor defined before)
also check if there is an invalid pointer coming from createTaskKernel
do the buffer accessor binding
and create a reference to packed raw device pointer in the entry kernel, and pass it to the actual Alpaka kernel function

Edit: @sbastrakov take your time, I am off for few hours now.

sbastrakov commented 5 years ago

@tdd11235813 no, just had an offline discussion with @psychocoderHPC , writing a summary. Edit: actually we realized that we missed an important detail and so our solution looked easier due to that. Accounting for everything would lead to basically the same as you just described. I will still write our thoughts separately in the next message.

j-stephan commented 5 years ago

@sbastrakov

* does `Kernel` miss a constructor that takes a `cl::sycl::accessor<TElem, 1, cl::sycl::access::mode::read>` by value and copies it to `bufAccA` (I suppose so)?

No, I'm just initializing the struct member directly. But it is a copy-by-value, yes.

* are `bufAccA1` and `bufAccA2` of the same data type between each other and with the `Kernel::bufAccA`?

That is an oversight on my side, sorry. bufAccA1 and bufAccA2 are actually different types. An accessor has the following definition:

namespace cl
{
    namespace sycl
    {
        template <typename dataT, int dimensions, access::mode accessmode,
                   access::target accessTarget = access::target::global_buffer,
                   access::placeholder isPlaceholder = access::placeholder::false_t>
        class accessor
        { /* ... */ };
    }
}

bufAccA1 and bufAccA2 are different specializations: bufAccA1 with access::placeholder::false_t and bufAccA2 with access::placeholder::true_t. bufAccA1 and Kernel::bufAccA are of the same type. In order to work with bufAccA2 the Kernel accessor would need to match bufAccA2's definition (as seen in Alternative 2).

sbastrakov commented 5 years ago

Just had an offline discussion with @psychocoderHPC about the pointers and SYCL accessors. What we eventually came to is similar to the earlier @tdd11235813's message.

We believe it is rather desirable to still (be able to) use raw pointers in kernels that a user writes. That is closer to the CUDA way, and also more straightforward for most people. With the SYCL requirements this, of course, requires alpaka to do the wrappers and conversions.

As justly pointed out, a single raw pointer to device memory is not enough to retrieve a SYCL or alpaka buffer. There are two alternative ways to go around it: a) Modify (and appropriately rename) getNativePtr() to return an object with additional information, from which one can obtain a pointer, maybe even implicitly convertible to a pointer. b) Stay with the raw pointers, and use bookkeeping for mapping between pointers and buffers. This is vulnerable for pointer arithmetics done by a user, but this is probably not a good thing to do anyway, and we can at least detect and throw exceptions. We now use similar bookkeeping in cupla.

Option b) can be done without any change in user codes. Option a) theoretically introduces changes, however in a typical scenario of getNativePtr() being called just to launch a kernel, also no change is needed once proper type conversions are provided by alpaka. I feel both are viable. I personally hate raw pointers and would prefer a) long-term, but maybe for now b) is better as it does not break anything.

Then the workflow is as follows:

A user writes kernel functors in the old way with raw pointers to the device memory
The launch is the same with option b) and most of the times the same with a) as well
On the alpaka kernel launcher side, we perform some kind of reflection to know which kernel parameters are pointers or types returned by the new getNativePtr(). I have mistakenly called it "tricky or worse", but now it seems to be rather straightforward actually: just filter types in a parameter pack depending on traits and remember their positions in the pack. Then we retrieve the buffers from pointers / new types with one of the two mechanisms described above and create accessors.
Now we define a type for and construct a SYCL kernel struct. Knowing a type list of buffers / accessors from the previously done reflection, with metaprogramming we can inject the necessary accessor members. We call it, then on the device side accessors are transformed to raw pointer and the user-written kernel is called (or maybe first another alpaka wrapper, if necessary). If necessary, we can also pass information through an alpaka accelerator (or derive a new type from it specifically for this scenario), as it is more of a context anyways.

Edit: it looks like that approach does not allow using statically allocated shared memory (in CUDA terms) in kernels, but for dynamically allocated shared memory I believe @psychocoderHPC has some idea.

sbastrakov commented 5 years ago

@j-stephan thanks for the clarifications. I assumed the constructor due to auto my_kernel = Kernel{bufAccA1};, but no need for further explanations on that.

Sorry for a wall of text in my previous message. I think this is close to what you and @tdd11235813 described as well, plus just the bookkeeping option. I tried to write in detail so that some non-obvious issues may become more clear.

j-stephan commented 5 years ago

Thanks for the input, I'll give the bookkeeping option a try. FYI: Implicit accessor-to-pointer casting doesn't work, this seems to be an oversight in the SYCL specification (see triSYCL/triSYCL#247). We have a get_pointer method in accessor's interface which should do the job, though.

j-stephan commented 5 years ago

I'm facing two more obstacles on which I'd appreciate some input :-)

In order to do synchronization on the block-level users call alpaka::block::sync::syncBlockThreads. Interestingly this is one of the few places where Alpaka doesn't provide the TDim template parameter. Unfortunately I need TDim for synchronization with SYCL because the equivalent to __syncthreads is a member of cl::sycl::nd_item<int dimensions>. The obvious solution would be the addition of TDim to the block-synchronization layer but there might be something I'm missing. And this would of course propagate this parameter to the other backends which don't need it AFAIK, thus increasing complexity. Thoughts?
Shared memory is a severe problem. Currently Alpaka allows users to allocate shared memory inside the kernel (alpaka::block::shared::st::allocVar), which is of course perfectly legal with CUDA. However, this is impossible with SYCL (except for a special type of kernel execution in which we lose block-synchronization semantics), the shared memory must be defined as an accessor before the kernel launch. I thought of default-allocating a block of shared memory for each kernel (even those which don't need it) but this isn't really efficient and might have some unwanted side-effects (it could affect the layout of an FPGA bitstream for example). The other solution I can think of is to make in-kernel shared memory allocations illegal and force the user to define his shared memory beforehand, thus mirroring SYCL's behaviour (and breaking Alpaka's API). I'm out of ideas here, maybe someone else has an idea for a better solution.

tdd11235813 commented 5 years ago

not sure, if there are news on this yet.

if TDim is required and cannot be derived from acc type at this place, then it has to be passed through the layer. Maybe it can be done as optional parameter, so other implementations will not break.
I am also unsure. Alpaka does not seem to provide a way to pass such information at kernel launch (just looked at the example on dynamic shared memory). So question remains, how to do it. For the pre-allocated shared memory solution we would need the maximal buffer size we can allocate without impacting occupancy. This will not work for FPGAs.

j-stephan commented 5 years ago

Maybe it can be done as optional parameter, so other implementations will not break.

AFAIK this would still break the API. Default template parameters still need empty brackets:

template <typename T = int>
class foo {};

auto my_foo = foo<>{};

Unfortunately I need TDim as a compile-time parameter, passing it by value to the constructor won't work.

j-stephan commented 5 years ago

The problem with block synchronization is solved. Surprisingly it was sufficient to pass TDim as template parameter to BlockSyncSycl, this worked for the SYCL case and didn't break anything else.

The problem with static shared memory still persists. Right now the only way seems to be pre-allocation of shared memory and then handing out pointers as needed. I'll get in touch with SYCL spec. committee members about this, perhaps they know of a solution.

Atomics will be a problem, too. SYCL atomics require the user to specify the memory layer to operate on, e.g. cl::sycl::atomic<int, cl::sycl::access::address_space::global_space>. Alpaka is passing raw pointers around and the SYCL compiler seems unable to deduce on which layer they reside on (tested with ComputeCpp). Right now I'm implementing everything as global atomics but this won't work in the long-term. This is something to pass on to the SYCL crowd, too, I believe.

tdd11235813 commented 5 years ago

thanks for the wrap-up. Just want to add that we also discussed about the options we might have on the Alpaka side in the long term. Changing the raw-pointer interface in Alpaka in favor of buffer types would allow more expressive types and helps with such issues we have in SYCL.

Regarding the static shared memory within kernel, it seems it simply does not exist in SYCL. @j-stephan also thought about to interop with a device OpenCL kernel, where static shared memory exists, grabbing the shared memory address. It is not clear yet, whether and how it would work. For the current work we can stick with dynamic shared memory (application code will be modified).

j-stephan commented 5 years ago

Changing the raw-pointer interface in Alpaka in favor of buffer types would allow more expressive types and helps with such issues we have in SYCL.

I'm in favor of this. :-) In the meantime I have submitted an issue about raw pointers to the SYCL specification repository, please see the discussion there: KhronosGroup/SYCL-Docs#21

Regarding the static shared memory within kernel, it seems it simply does not exist in SYCL.

I'm still not sure why, we might get an answer here: KhronosGroup/SYCL-Docs#20

@j-stephan also thought about to interop with a device OpenCL kernel, where static shared memory exists, grabbing the shared memory address. It is not clear yet, whether and how it would work.

@DuncanMcBain from Codeplay informed me that this will likely not work on non-CPU devices, so basically everything we care about (with SYCL). I invited him to join the discussion here, so we might get some input from the SYCL point of view in the future. :-)

j-stephan commented 5 years ago

By the way, both CUDA and SYCL support half-precision floating point operations. Do we support a half type in Alpaka, too? I couldn't find any reference to a half type using the GitHub search for this repository.

j-stephan commented 5 years ago

@tdd11235813 and I just had a discussion about the support for different SYCL devices. Right now all Alpaka platforms are tied to specific device types (CUDA -> GPU, HIP -> GPU, OpenMP -> CPU, and so on). SYCL supports different device types and the user will probably want to choose his preferred device with Alpaka. Right now there is no mechanism to do this in Alpaka. I thought about using a static function in the platform layer which returns all supported platforms by this backend (which would only be 1 for CUDA, HIP, OpenMP and >= 1 for SYCL). The user can then iterate over the platforms (= the OpenCL runtime) and associated devices to make his choice. But maybe there is a better way to handle this?

psychocoderHPC commented 5 years ago

By the way, both CUDA and SYCL support half-precision floating point operations. Do we support a half type in Alpaka, too? I couldn't find any reference to a half type using the GitHub search for this repository.

Currently, we have no alpaka provided half-precision type. With half precision in CUDA you have also the problem that there are no overloaded operators for half available.

ax3l commented 5 years ago

Right now all Alpaka platforms are tied to specific device types (CUDA -> GPU, HIP -> GPU, OpenMP -> CPU, and so on). SYCL supports different device types and the user will probably want to choose his preferred device with Alpaka. Right now there is no mechanism to do this in Alpaka.

Good point. That's also something a potential OpenMP 4.5/5.0 offloading (target) backend would need to know (afaik, that's usually a compiler flag for OpenMP offloading targets).

fwyzard commented 5 years ago

The main problem we both see is Alpaka's reliance on raw pointers

Intel is working on a proposal to add pointers to OpenCL and SYCL, with support for unified addressing and unified memory: usm.adoc.

fwyzard commented 4 years ago

if the goal of a SYCL backend is mainly to support Intel hardware, and alternative could be to use the "Level Zero" API: https://spec.oneapi.com/versions/latest/oneL0/index.html , which looks a lot like the CUDA driver API.

sbastrakov commented 4 years ago

Thanks for the info @fwyzard !

j-stephan commented 4 years ago

While interesting it appears to me that we won't be able to use oneL0 anytime soon. From a quick glance at the docs I believe it requires precompiled kernels, meaning SPIR-V or the native device format. This doesn't fit our current approach.

fwyzard commented 4 years ago

Yes, you are right. It is more in line with OpenCL or the CUDA driver API that SYCL/oneAPI.

alpaka-group / alpaka

Evaluation of SYCL vs. alpaka & CUDA (HC deprecated) #690

General Remarks

Programming Interface

Function Attributes

Memory

Index / Work Division

Types

Runtime API

Device Management

Error Handling

Queue Management

Event Management

Memory Management

Execution Control

Peer Device Memory Access