Closed j-stephan closed 2 years ago
awesome work, thanks for the very helpful rundown. Now we can check for some possible show stopper and clumsy workarounds. For HC another question would be, whether we would gain something from it compared to HIP. SYCL would be very interesting for us, although it is not clear at the moment, whether triSYCL's extensions to SYCL for FPGAs are mandatory (@j-stephan correct me, if I am wrong). Maybe workarounds or specializations for FPGA code parts can help though.
I've just had a look at the extensions that are bundled with Xilinx' triSYCL implementation.
All in all it looks like using these extensions is not mandatory, but probably a good idea in order to achieve better performance.
Btw, you can vote for SYCL support in MSVS here: https://developercommunity.visualstudio.com/idea/490213/sycl-support.html
I am currently facing some issues with clashing concepts between SYCL and Alpaka. I had some offline discussion with @tdd11235813 about this today, too.
The main problem we both see is Alpaka's reliance on raw pointers:
struct Kernel {
/* ... */
template</*...*/>
ALPAKA_FN_ACC auto operator()(
TAcc const & acc,
TElem const * const A, /* ... */
};
auto const taskKernel(alpaka::kernel::createTaskKernel<Acc>(
workDiv,
kernel,
alpaka::mem::view::getPtrNative(bufAccA), /*...*/
This is probably a CUDA inheritance and has worked well so far with the existing backends; however, passing pointers to SYCL kernels is a big no-no:
It is illegal to pass a pointer or reference argument to a kernel. -- SYCL Spec., 4.8.10
In SYCL an equivalent kernel would look like this:
template</*...*/>
struct Kernel
{
cl::sycl::accessor<TElem, 1, cl::sycl::access::mode::read> bufAccA;
auto operator()(cl::sycl::nd_item<1>)
{ /* Kernel function */ }
};
auto bufA = cl::sycl::buffer<TElem, 1>{/* ... */};
auto queue = cl::sycl::queue{/* ... */};
// Alternative 1
queue.enqueue([&](cl::sycl::handler& cgh)
{
// control group lives inside this scope
auto bufAccA1 = bufA.get_access<cl::sycl::access::mode::read>(cgh);
auto my_kernel = Kernel{bufAccA1};
cgh.parallel_for(cl::sycl::nd_range<1>{/* dimensions */}, my_kernel);
});
// Alternative 2
auto bufAccA2 = cl::sycl::accessor<TElem, 1,
cl::sycl::access::mode::read,
cl::sycl::access::target::global_buffer,
cl::sycl::access::placeholder::true_t>{bufA};
queue.enqueue([&](cl::sycl::handler& cgh)
{
// another control group lives in this scope
cgh.require(bufAccA2); // bind buffer to this control group
auto my_kernel = Kernel{bufAccA2};
cgh.parallel_for(cl::sycl::nd_range<1>{/* dimensions */}, my_kernel);
});
Instead of pointers a SYCL kernel requires accessor
s. These are bound to a device-side buffer inside a control group (= the SYCL equivalent to Alpaka's tasks) and passed-by-value to the kernel. Internally they are encapsulating the corresponding OpenCL C pointer.
As we all know, Alpaka kernels aren't actually kernels but device-side functions. So why don't we just use the OpenCL C pointer instead and somehow hide the accessor in the backend? Because we can't:
If an
accessor
has the access targetaccess::target::global_buffer
,access::target::constant_buffer
,access::target::local
,access::target::image
oraccess::target::image_array
then it is considered a device accessor, and therefore can only be used within a SYCL kernel function [...] -- SYCL Spec., 4.7.3.6
Since the accessor
's method for acquiring the OpenCL C pointer is only available when accessTarget == access::target::global_buffer
or accessTarget == access::target::constant_buffer
we are unable to call it from the host; Alpaka's getPtrNative()
function will simply never work with SYCL. We could hide the accessor
itself behind getPtrNative()
but in this case the user won't actually acquire the pointer which he might expect (to do pointer arithmetics or whatever).
What we can do is cast an accessor to a pointer inside the kernel. However, I don't see a way to deduce an accessor on the host side from a given device pointer in the Alpaka signature above since it is missing size information as well as address information (global vs. shared memory for example) - nevermind that accessing this device-side pointer in any way on the host is likely completely illegal in SYCL.
I'd be thankful for any input on this; for now I'm working on mirroring SYCL's accessor
concepts in Alpaka for my SYCL prototype, e.g. alpaka::mem::access::getAccess
instead of alpaka::mem::view::getPtrNative
. This would return an Alpaka accessor
, encapsulating a SYCL accessor in my backend and (for example) a CUDA pointer in the CUDA backend.
since it is similar to OpenCL (e.g. abstract cl_mem objects), the missing piece for the cast-accessor-to-pointer-workaround would be the redefinition of the kernel signature on the host side, so we know the types of the pointer we have to cast to on the device side. In OpenCL you do this by clSetKernelArg
on the host side.
In TaskKernelSycl we actually know the types passed via createTaskKernel, but when we just get a SYCL accessor, then type information is lost. A workaround could be to put that accessor in another wrapper, where we collect the type information by ourselves.
If it works, options like target, read, read_write, or write still cannot be set explicitly and not all can be derived. But in this case, const vs non-const might be enough to distiniguish read vs read-write.
Jan's current approach with the explicit Alpaka accessor design would change the kernel interface where pointers are forbidden for all backends.
Another workaround could come with function traits, so you do not need the wrapper for the accessor and get the types from the kernel function directly, which also gives you the const/non-constness of the arguments. Not sure, if this would work, never tried to implement function traits.
https://functionalcpp.wordpress.com/2013/08/05/function-traits/ https://stackoverflow.com/questions/32036556/iterating-through-function-traits-arguments/32036651#32036651
Edit: will not work cuz of the templates,see below
@j-stephan thanks for a detailed explanation. I have a couple of questions regarding your code example. Sorry if they are stupid, I am not familiar with SYCL technicalities.
Kernel
miss a constructor that takes a cl::sycl::accessor<TElem, 1, cl::sycl::access::mode::read>
by value and copies it to bufAccA
(I suppose so)?bufAccA1
and bufAccA2
of the same data type between each other and with the Kernel::bufAccA
?@tdd11235813 I believe function traits are probably tricky to use (or even worse) for Alpaka kernels, as their operator()
are not functions, but function templates, as at least the accelerator type is normally a template parameter.
ok, I thought to instantiate the function template with the type parameter pack we get from the createTaskKernel call. In this cumbersome way you get the const/non-constness and the remaining types from the function traits.
nvm, function traits will not work cuz of the templates and cuz of the different types with accessors in the createTaskKernel call and the raw pointers in the kernel function :/
That's what I originally thought. However, after some more thinking your point is probably valid and maybe with some wrapping and coding it is possible to get the info we need.
you mean to combine the two ideas, wrapper for SYCL accessor with type informations and instantiating the function template with the underlying types to construct the function traits?
getNativePtr()
returns wrapper in SYCL case (can define pointer arithmetic with the wrapper)createTaskKernel
passes args to TaskKernelSycl and to the entry kernel and on probably host our wrapper types have to be parsed at compile-time, also checking for const/non-constness via instantiated function template traits, to define the accessor properties (ok, probably to late, cuz we already have an accessor defined before)Edit: @sbastrakov take your time, I am off for few hours now.
@tdd11235813 no, just had an offline discussion with @psychocoderHPC , writing a summary. Edit: actually we realized that we missed an important detail and so our solution looked easier due to that. Accounting for everything would lead to basically the same as you just described. I will still write our thoughts separately in the next message.
@sbastrakov
* does `Kernel` miss a constructor that takes a `cl::sycl::accessor<TElem, 1, cl::sycl::access::mode::read>` by value and copies it to `bufAccA` (I suppose so)?
No, I'm just initializing the struct member directly. But it is a copy-by-value, yes.
* are `bufAccA1` and `bufAccA2` of the same data type between each other and with the `Kernel::bufAccA`?
That is an oversight on my side, sorry. bufAccA1
and bufAccA2
are actually different types. An accessor
has the following definition:
namespace cl
{
namespace sycl
{
template <typename dataT, int dimensions, access::mode accessmode,
access::target accessTarget = access::target::global_buffer,
access::placeholder isPlaceholder = access::placeholder::false_t>
class accessor
{ /* ... */ };
}
}
bufAccA1
and bufAccA2
are different specializations: bufAccA1
with access::placeholder::false_t
and bufAccA2
with access::placeholder::true_t
. bufAccA1
and Kernel::bufAccA
are of the same type. In order to work with bufAccA2
the Kernel accessor would need to match bufAccA2
's definition (as seen in Alternative 2).
Just had an offline discussion with @psychocoderHPC about the pointers and SYCL accessors. What we eventually came to is similar to the earlier @tdd11235813's message.
We believe it is rather desirable to still (be able to) use raw pointers in kernels that a user writes. That is closer to the CUDA way, and also more straightforward for most people. With the SYCL requirements this, of course, requires alpaka to do the wrappers and conversions.
As justly pointed out, a single raw pointer to device memory is not enough to retrieve a SYCL or alpaka buffer. There are two alternative ways to go around it:
a) Modify (and appropriately rename) getNativePtr()
to return an object with additional information, from which one can obtain a pointer, maybe even implicitly convertible to a pointer.
b) Stay with the raw pointers, and use bookkeeping for mapping between pointers and buffers. This is vulnerable for pointer arithmetics done by a user, but this is probably not a good thing to do anyway, and we can at least detect and throw exceptions. We now use similar bookkeeping in cupla.
Option b) can be done without any change in user codes. Option a) theoretically introduces changes, however in a typical scenario of getNativePtr()
being called just to launch a kernel, also no change is needed once proper type conversions are provided by alpaka. I feel both are viable. I personally hate raw pointers and would prefer a) long-term, but maybe for now b) is better as it does not break anything.
Then the workflow is as follows:
getNativePtr()
. I have mistakenly called it "tricky or worse", but now it seems to be rather straightforward actually: just filter types in a parameter pack depending on traits and remember their positions in the pack. Then we retrieve the buffers from pointers / new types with one of the two mechanisms described above and create accessors.Edit: it looks like that approach does not allow using statically allocated shared memory (in CUDA terms) in kernels, but for dynamically allocated shared memory I believe @psychocoderHPC has some idea.
@j-stephan thanks for the clarifications. I assumed the constructor due to auto my_kernel = Kernel{bufAccA1};
, but no need for further explanations on that.
Sorry for a wall of text in my previous message. I think this is close to what you and @tdd11235813 described as well, plus just the bookkeeping option. I tried to write in detail so that some non-obvious issues may become more clear.
Thanks for the input, I'll give the bookkeeping option a try. FYI: Implicit accessor-to-pointer casting doesn't work, this seems to be an oversight in the SYCL specification (see triSYCL/triSYCL#247). We have a get_pointer
method in accessor
's interface which should do the job, though.
I'm facing two more obstacles on which I'd appreciate some input :-)
In order to do synchronization on the block-level users call alpaka::block::sync::syncBlockThreads
. Interestingly this is one of the few places where Alpaka doesn't provide the TDim
template parameter. Unfortunately I need TDim
for synchronization with SYCL because the equivalent to __syncthreads
is a member of cl::sycl::nd_item<int dimensions>
. The obvious solution would be the addition of TDim
to the block-synchronization layer but there might be something I'm missing. And this would of course propagate this parameter to the other backends which don't need it AFAIK, thus increasing complexity. Thoughts?
Shared memory is a severe problem. Currently Alpaka allows users to allocate shared memory inside the kernel (alpaka::block::shared::st::allocVar
), which is of course perfectly legal with CUDA. However, this is impossible with SYCL (except for a special type of kernel execution in which we lose block-synchronization semantics), the shared memory must be defined as an accessor
before the kernel launch. I thought of default-allocating a block of shared memory for each kernel (even those which don't need it) but this isn't really efficient and might have some unwanted side-effects (it could affect the layout of an FPGA bitstream for example). The other solution I can think of is to make in-kernel shared memory allocations illegal and force the user to define his shared memory beforehand, thus mirroring SYCL's behaviour (and breaking Alpaka's API). I'm out of ideas here, maybe someone else has an idea for a better solution.
not sure, if there are news on this yet.
Maybe it can be done as optional parameter, so other implementations will not break.
AFAIK this would still break the API. Default template parameters still need empty brackets:
template <typename T = int>
class foo {};
auto my_foo = foo<>{};
Unfortunately I need TDim
as a compile-time parameter, passing it by value to the constructor won't work.
The problem with block synchronization is solved. Surprisingly it was sufficient to pass TDim
as template parameter to BlockSyncSycl
, this worked for the SYCL case and didn't break anything else.
The problem with static shared memory still persists. Right now the only way seems to be pre-allocation of shared memory and then handing out pointers as needed. I'll get in touch with SYCL spec. committee members about this, perhaps they know of a solution.
Atomics will be a problem, too. SYCL atomics require the user to specify the memory layer to operate on, e.g. cl::sycl::atomic<int, cl::sycl::access::address_space::global_space>
. Alpaka is passing raw pointers around and the SYCL compiler seems unable to deduce on which layer they reside on (tested with ComputeCpp). Right now I'm implementing everything as global atomics but this won't work in the long-term. This is something to pass on to the SYCL crowd, too, I believe.
thanks for the wrap-up. Just want to add that we also discussed about the options we might have on the Alpaka side in the long term. Changing the raw-pointer interface in Alpaka in favor of buffer types would allow more expressive types and helps with such issues we have in SYCL.
Regarding the static shared memory within kernel, it seems it simply does not exist in SYCL. @j-stephan also thought about to interop with a device OpenCL kernel, where static shared memory exists, grabbing the shared memory address. It is not clear yet, whether and how it would work. For the current work we can stick with dynamic shared memory (application code will be modified).
Changing the raw-pointer interface in Alpaka in favor of buffer types would allow more expressive types and helps with such issues we have in SYCL.
I'm in favor of this. :-) In the meantime I have submitted an issue about raw pointers to the SYCL specification repository, please see the discussion there: KhronosGroup/SYCL-Docs#21
Regarding the static shared memory within kernel, it seems it simply does not exist in SYCL.
I'm still not sure why, we might get an answer here: KhronosGroup/SYCL-Docs#20
@j-stephan also thought about to interop with a device OpenCL kernel, where static shared memory exists, grabbing the shared memory address. It is not clear yet, whether and how it would work.
@DuncanMcBain from Codeplay informed me that this will likely not work on non-CPU devices, so basically everything we care about (with SYCL). I invited him to join the discussion here, so we might get some input from the SYCL point of view in the future. :-)
By the way, both CUDA and SYCL support half-precision floating point operations. Do we support a half
type in Alpaka, too? I couldn't find any reference to a half type using the GitHub search for this repository.
@tdd11235813 and I just had a discussion about the support for different SYCL devices. Right now all Alpaka platforms are tied to specific device types (CUDA -> GPU, HIP -> GPU, OpenMP -> CPU, and so on). SYCL supports different device types and the user will probably want to choose his preferred device with Alpaka. Right now there is no mechanism to do this in Alpaka. I thought about using a static
function in the platform layer which returns all supported platforms by this backend (which would only be 1 for CUDA, HIP, OpenMP and >= 1 for SYCL). The user can then iterate over the platforms (= the OpenCL runtime) and associated devices to make his choice. But maybe there is a better way to handle this?
By the way, both CUDA and SYCL support half-precision floating point operations. Do we support a
half
type in Alpaka, too? I couldn't find any reference to a half type using the GitHub search for this repository.
Currently, we have no alpaka provided half-precision type. With half precision in CUDA you have also the problem that there are no overloaded operators for half
available.
Right now all Alpaka platforms are tied to specific device types (CUDA -> GPU, HIP -> GPU, OpenMP -> CPU, and so on). SYCL supports different device types and the user will probably want to choose his preferred device with Alpaka. Right now there is no mechanism to do this in Alpaka.
Good point. That's also something a potential OpenMP 4.5/5.0 offloading (target) backend would need to know (afaik, that's usually a compiler flag for OpenMP offloading targets).
The main problem we both see is Alpaka's reliance on raw pointers
Intel is working on a proposal to add pointers to OpenCL and SYCL, with support for unified addressing and unified memory: usm.adoc.
if the goal of a SYCL backend is mainly to support Intel hardware, and alternative could be to use the "Level Zero" API: https://spec.oneapi.com/versions/latest/oneL0/index.html , which looks a lot like the CUDA driver API.
Thanks for the info @fwyzard !
While interesting it appears to me that we won't be able to use oneL0 anytime soon. From a quick glance at the docs I believe it requires precompiled kernels, meaning SPIR-V or the native device format. This doesn't fit our current approach.
Yes, you are right. It is more in line with OpenCL or the CUDA driver API that SYCL/oneAPI.
I am currently evaluating whether AMD's HC and Khronos' SYCL are suitable to be used as backends for alpaka. (@tdd11235813 is my supervisor in case someone wonders who I am.) This is the mapping I have come up with so far (based on the CUDA mapping available in doc directory). Please feel free to discuss this map or to request feature mappings that are needed but not monitored here.
General Remarks
alpaka
namespace.hc
namespace.cl::sycl
namespace.Programming Interface
Function Attributes
HCALPAKA_FN_HOST
__host__ void f()
void f() [[cpu]]
ALPAKA_FN_ACC
__device__ void f()
void f() [[hc]]
ALPAKA_FN_ACC
__global__ void f()
void f() [[hc]]
ALPAKA_FN_HOST_ACC
__host__ __device__ void f()
void f() [[hc]] [[cpu]]
Memory
block::shared::st::allocVar<uint32_t, __COUNTER__>(acc)
__shared__
tile_static
access::target::local
ALPAKA_STATIC_ACC_MEM_CONSTANT
__constant__
impliedaccess::target::constant_buffer
ALPAKA_STATIC_ACC_MEM_GLOBAL
__device__
unknown - global var?Index / Work Division
idx::getIdx<Block, Threads>(acc)
threadIdx.{x,y,z}
t_idx.local[dim]
nd_item.get_local_id(dim)
idx::getIdx<Grid, Blocks>(acc)
blockIdx.{x,y,z}
t_idx.tile[dim]
nd_item.get_group(dim)
workdiv::getWorkDiv<Block, Threads>(acc)
blockDim.{x,y,z}
t_ext.tile_dim[dim]
nd_item.get_local_range(dim)
workdiv::getWorkDiv<Grid, Blocks>(acc)
gridDim.{x,y,z}
t_ext[dim]
nd_item.get_group_range(dim)
Types
vec::Vec<TDim, TVal>
vec3
extent<dim>
ortiled_extent<dim>
range<dim>
ornd_range<dim>
Runtime API
Device Management
dev::reset(device)
cudaDeviceReset()
no equivalentdevice
leaves scopewait::wait(device)
cudaDeviceSynchronize()
accelerator_view::wait()
size_t pltf::getDevCount<TPltf>()
int cudaGetDeviceCount()
vector<accelerator> accelerator::get_all()
vector<device> platform::get_devices()
orvector<device> context::get_devices()
acc::getAccDevProps(dev)
cudaGetDeviceProperties()
accelerator_view::get_PROPERTY
*device::get_info()
* HC accelerators have getter functions for each queryable property.Error Handling
alpaka, HC and SYCL use exceptions, these should be easily convertible. CUDA uses a C-style error return type.
Queue Management
queue::enqueue(queue, [](){do_something();})
cudaStreamAddCallback()
auto future = parallel_for_each(accelerator_view, extent<dim>{...}, []() { do_something();})
queue::submit(T cgf)
*queue = queue::QueueCudaRtAsync(device);
cudaStreamCreate()
accelerator_view = accelerator::get_default_view()
queue
constructorqueue = queue::QueueCudaRtSync(device);
cudaStreamCreate()
If you want synchronous behaviour, leave out theauto future =
when enqueuing the kernel. Thefuture
's destructor will automatically synchronize.queue
is async by default. Callqueue::wait_and_throw()
for synchronization.bool queue::empty(queue)
cudaStreamQuery()
completion_future::is_ready()
void wait::wait(queue)
cudaStreamSynchronize()
completion_future::wait()
completion_future::wait_for(std::chrono::duration)
completion_future::wait_until(std::chrono::time_point)
queue::wait_and_throw()
void wait::wait(queue, event)
cudaStreamWaitEvent()
needs workaround*
T cgf
refers to a Command Group Function object and combines the actual kernel with its requirements (e.g. memcpys). This function object has the following signature:(handler& cgh)
and can be anoperator()
or a Lambda.Event Management
HC does not have a concept of events. A similar behavior can be achieved by using thecompletion_future
returned byparallel_for_each()
which mirrorsstd::future
's behavior.auto event = event::Event<TQueue>{dev};
cudaEventCreate()
bool event::test(event)
cudaEventQuery()
info::event_command_status event::get_info<info::event::command_execution_status>()
void queue::enqueue(queue, event)
cudaEventRecord()
auto event = queue::submit()
wait::wait(event)
cudaEventSynchronize()
event::wait_and_throw()
Memory Management
mem::buf::alloc<TElement>(device, extents1D)
cudaMalloc()
array<T, 1>{extent<1>{size}, accelerator_view}
buffer<T>{range<1>{size}}
mem::buf::alloc<TElement>(device, extents3D)
cudaMalloc3D()
array<T, 3>{extent<3>{size0, size1, size2}, accelerator_view}
buffer<T>{range<3>{size0, size1, size2}}
mem::buf::alloc<TElement>(device, extents)
cudaMallocHost()
am_alloc(size, accelerator, amHostPinned)
*mem::buf::alloc<TElement>(device, extents2D)
cudaMallocPitch()
array<T, 2>{extent<2>{size0, size1}, accelerator_view}
buffer<T>{range<2>{size0, size1}
dev::getMemBytes()
cudaMemGetInfo()
accelerator::get_dedicated_memory()
**device::get_info<info::device::global_mem_size>()
dev::getFreeMemBytes()
cudaMemGetInfo()
needs workaroundmem::view::copy(memBufDst, memBufSrc, extents1D)
cudaMemcpy()
copy(src, dst)
accessor
Explicit:
handler::copy(src, dst)
followed byqueue.wait_and_throw()
mem::view::copy(memBufDst, memBufSrc, extents2D)
cudaMemcpy2D()
copy(src, dst)
accessor
Explicit:
handler::copy(src, dst)
followed byqueue.wait_and_throw()
mem::view::copy(memBufDst, memBufSrc, extents2D, queue)
cudaMemcpy2DAsync()
copy_async(src, dst)
accessor
Explicit:
handler::copy(src, dst)
mem::view::copy(memBufDst, memBufSrc, extents3D)
cudaMemcpy3D()
copy(src, dst)
accessor
Explicit:
handler::copy(src, dst)
followed byqueue.wait_and_throw()
mem::view::copy(memBufDst, memBufSrc, extents3D, queue)
cudaMemcpy3DAsync()
copy_async(src, dst)
accessor
Explicit:
handler::copy(src, dst)
mem::view::copy(memBufDst, memBufSrc, extents3D)
cudaMemcpy3DPeer()
copy(src, dst)
***mem::view::copy(memBufDst, memBufSrc, extents3D, queue)
cudaMemcpy3DPeerAsync()
copy_async(src, dst)
***mem::view::copy(memBufDst, memBufSrc, extents1D, queue)
cudaMemcpyAsync()
copy_async(src, dst)
accessor
Explicit:
handler::copy(src, dst)
mem::view::copy(memBufDst, memBufSrc, extents1D)
cudaMemcpyPeer()
copy(src, dst)
***mem::view::copy(memBufDst, memBufSrc, extents1D, queue)
cudaMemcpyPeerAsync()
copy_async(src, dst)
***mem::view::set(memBufDst, byte, extents1D)
cudaMemset()
no equivalentaccessor
Explicit:
handler::fill(dst, val)
followed byqueue.wait_and_throw()
mem::view::set(memBufDst, byte, extents2D)
cudaMemset2D()
no equivalentaccessor
Explicit:
handler::fill(dst, val)
followed byqueue.wait_and_throw()
mem::view::set(memBufDst, byte, extents2D, queue)
cudaMemset2DAsync()
no equivalentaccessor
Explicit:
handler::fill(dst, val)
mem::view::set(memBufDst, byte, extents3D)
cudaMemset3D()
no equivalentaccessor
Explicit:
handler::fill(dst, val)
followed byqueue.wait_and_throw()
mem::view::set(memBufDst, byte, extents3D, queue)
cudaMemset3DAsync()
no equivalentaccessor
Explicit:
handler::fill(dst, val)
mem::view::set(memBufDst, byte, extents1D, queue)
cudaMemset()
no equivalentaccessor
Explicit:
handler::fill(dst, val)
followed byqueue.wait_and_throw()
*
am_alloc
returns a raw pointer that needs to be freed witham_free
.** Returns the value in KiB.
*** Assumed - see below.
Execution Control
kernel::exec<TAcc>(queue, workDiv, kernel, params...)
kernel::BlockSharedExternMemSizeBytes<TKernel<TAcc>>::getBlockSharedExternMemSizeBytes<...>(...)
cudaLaunchKernel()
parallel_for_each(accelerator_view, extent, kernel)
parallel_for(range<dim>(work_size), kernelFunc)
orparallel_for_work_group(range<dim>(work_size), range<dim>(block_size), kernelFunc)
Peer Device Memory Access
While HCaccelerator
s know their peers, there are no explicit API functions to access peer memory. One can assume that P2P copies are done implicitly but this requires further testing.The SYCL specification does not even mention the work "peer". Some implementations might implicitly do P2P copies if available but this requires further testing.
Edit (2018-11-07): Fixed some typos and formatting.