KhronosGroup / SYCL-Docs

SYCL Open Source Specification
Other
116 stars 68 forks source link

Consider adding support for explicit strided host to device copies #91

Open psalz opened 4 years ago

psalz commented 4 years ago

I sometimes find myself in a situation where I'd like to copy some strided data from the host to a SYCL buffer on the device, for subsequent use in a kernel. As it turns out however, the existing APIs for explicit memory operations only allow me to pass a contiguous host pointer as the source of a copy.

I had previously been playing with the idea of using a temporary SYCL buffer constructed with a pointer to my (strided) host memory, from which I could then create a host accessor to use as the src in an explicit copy operation. However, @AerialMantis pointed out to me that the copy is considered a kernel executed on the device (see section 4.8.6), and using host accessors inside kernel functions results in undefined behavior (as they are only allowed to be used on the host, see section 4.7.6.3).

Now, maybe one way of doing this could be to use a device accessor for my src as well, and hope that the SYCL runtime will recognize that instead of doing a H -> D -> D copy, this could be optimized to a strided H -> D copy. However, in doing so I'm pretty much at the mercy of the implementors, and have no guarantee about how much memory will actually be used for this operation (other than an upper bound). More likely than not, in any of the current implementations, the entire temporary host buffer would first be copied to the device (correct me if I'm wrong!).

The other option, which I'm using now, is to first do a host-side copy of the strided data into a contiguous staging buffer, and using that as the src for the copy operation. That is of course not ideal, and if host memory gets tight, might also not be feasible (especially if the implementation uses another pinned staging buffer internally...).

Ultimately I think, given that both OpenCL and CUDA provide APIs for doing strided H -> D copies, SYCL could also benefit from having something like this.

As I recently ran into this issue again, it got me thinking: Why not simply provide the ability to create a SYCL accessor for arbitrary user pointers? Like so:

float* my_ptr = malloc(128 * 128 * sizeof(float));
// ...
cl::sycl::accessor<
 float, 2, cl::sycl::access::mode::read_write /* mode is probably not needed */,
 cl::sycl::access::target::user_pointer>
my_accessor(my_ptr,
 cl::sycl::range<2>(128, 128) /* range of data pointed to */,
 cl::sycl::range<2>(64, 128) /* optional sub-range to access */,
 cl::sycl::id<2>(32, 0) /* optional offset to access */);

With this API, my_accessor could then be used as the src in an explicit copy, implying that the copy should be strided. As an added bonus, such an accessor would allow users to index into their self-managed data just like they can for SYCL buffers, without having to worry about the data's layout in memory.

keryell commented 4 years ago

I am working on some extensions based on mdspan C++ proposal to express this kind of feature for some embedded devices to leverage low level DMA features. So I am very supportive. I recently switched from https://github.com/ORNL/cpp-proposals-pub.git to https://github.com/kokkos/mdspan Look at this mdspan C++ proposal if you are not aware of it and think about it. I hope I will open-source our implementation this year for CPU at least. But for now it is used more in a pipe context. So you can make some API proposals in the context of buffers and accessors...

psalz commented 4 years ago

Interesting! From what I understand, a basic_mdspan with a layout_stride could be used in place of my proposed user pointer accessor. I can see that there might be potentially many more use cases for mdspan in SYCL (maybe they could even replace SYCL accessors altogether at some point..?) and will be interested to see how this evolves.

Are there any publicly accessible proposals for the extensions you mentioned?

keryell commented 4 years ago

Yes, the mdspan with some properties could replace the accessors at some point. Sorry, there is nothing public yet for the DMA extensions on my side.

fraggamuffin commented 2 years ago

still need copy interface even with mdspan DB+TD working on this