Open psalz opened 4 years ago
I am working on some extensions based on mdspan
C++ proposal to express this kind of feature for some embedded devices to leverage low level DMA features. So I am very supportive.
I recently switched from https://github.com/ORNL/cpp-proposals-pub.git to https://github.com/kokkos/mdspan
Look at this mdspan
C++ proposal if you are not aware of it and think about it.
I hope I will open-source our implementation this year for CPU at least.
But for now it is used more in a pipe context. So you can make some API proposals in the context of buffers and accessors...
Interesting! From what I understand, a basic_mdspan
with a layout_stride
could be used in place of my proposed user pointer accessor. I can see that there might be potentially many more use cases for mdspan
in SYCL (maybe they could even replace SYCL accessors altogether at some point..?) and will be interested to see how this evolves.
Are there any publicly accessible proposals for the extensions you mentioned?
Yes, the mdspan
with some properties could replace the accessors at some point.
Sorry, there is nothing public yet for the DMA extensions on my side.
still need copy interface even with mdspan DB+TD working on this
I sometimes find myself in a situation where I'd like to copy some strided data from the host to a SYCL buffer on the device, for subsequent use in a kernel. As it turns out however, the existing APIs for explicit memory operations only allow me to pass a contiguous host pointer as the source of a copy.
I had previously been playing with the idea of using a temporary SYCL buffer constructed with a pointer to my (strided) host memory, from which I could then create a host accessor to use as the
src
in an explicit copy operation. However, @AerialMantis pointed out to me that the copy is considered a kernel executed on the device (see section 4.8.6), and using host accessors inside kernel functions results in undefined behavior (as they are only allowed to be used on the host, see section 4.7.6.3).Now, maybe one way of doing this could be to use a device accessor for my
src
as well, and hope that the SYCL runtime will recognize that instead of doing a H -> D -> D copy, this could be optimized to a strided H -> D copy. However, in doing so I'm pretty much at the mercy of the implementors, and have no guarantee about how much memory will actually be used for this operation (other than an upper bound). More likely than not, in any of the current implementations, the entire temporary host buffer would first be copied to the device (correct me if I'm wrong!).The other option, which I'm using now, is to first do a host-side copy of the strided data into a contiguous staging buffer, and using that as the
src
for the copy operation. That is of course not ideal, and if host memory gets tight, might also not be feasible (especially if the implementation uses another pinned staging buffer internally...).Ultimately I think, given that both OpenCL and CUDA provide APIs for doing strided H -> D copies, SYCL could also benefit from having something like this.
As I recently ran into this issue again, it got me thinking: Why not simply provide the ability to create a SYCL accessor for arbitrary user pointers? Like so:
With this API,
my_accessor
could then be used as thesrc
in an explicit copy, implying that the copy should be strided. As an added bonus, such an accessor would allow users to index into their self-managed data just like they can for SYCL buffers, without having to worry about the data's layout in memory.