Give user the ability to specify buffer offsets in enqueue_transform

SyamGadde commented 7 years ago

Sorry for the raft of pull requests -- these are all modifications I've been using successfully for a while and have found very useful, and would welcome your input.

This set of commits adds in_offsets and out_offsets to enqueue_transform(), which, if specified, triggers the creation of sub-buffers under the hood. For those arrays that include multiple non-transformed axes that can't be "collapsed" into a single axis, this change allows the user to slice the input array into appropriate chunks (using the standard slice operator) and simply send x.base_data and x.offset as the inputs/outputs -- though only as long as the offset is a multiple of CL_DEVICE_MEM_BASE_ADDR_ALIGN. It also allows for transforming arrays that have non-zero offsets. For example, to partially address issue #10 without needing to create copies of the original array or create our own OpenCL sub-buffers:

import gpyfft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy; a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a); b_gpu = pyopencl.array.zeros_like(a_gpu)

# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])

# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]

# here we transform all 13 chunks using the low-level API (so we don't have to re-create the plan)
# (all but the first chunk have non-zero offsets)
events = [
    transform.plan.enqueue_transform((queue,), (a_gpu.base_data,), (b_gpu.base_data,), in_offsets=(a_gpu[:,:,i].offset,), out_offsets=(b_gpu[:,:,i].offset,))[0]
    for i in range(13)
]
for event in events:
    event.wait()
result2 = b_gpu.get()

geggo commented 7 years ago

Hi, thanks for providing a way to allow more general memory layouts and resolve issue #10. I need more time to look into this.

After having a quick glance at the code I have some comments/questions:

I would like to keep the low-level wrapper a simple, thin layer around the clFFT library, without adding too much functionality not present in the library. Just for the sake of maintainability. Therefor I would prefer if all the sub buffer handling takes place in the high-level wrapper, ideally in pure python, relying on pyopencl. Do you thinks that is possible and makes sense?

SyamGadde commented 7 years ago

Sure, I don't see why it can't be done all in the high-level API -- in that case the same thing could be done in enqueue() instead. I'll amend this in a bit...

SyamGadde commented 7 years ago

OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data and result attributes and allow users to call enqueue repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.

geggo commented 7 years ago

Am 25.01.2017 um 16:21 schrieb Syam Gadde notifications@github.com:

OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data and result attributes and allow users to call enqueue repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.

Ah, I see, you want to iterate over slices while keeping the plan. Adding optional in/out arrays as replacements for the stored ones for the high-level enqueue is a possibility, ok for me.

Ultimately, as I understood, all this is needed to allow for an extended batching scheme with more than one non-transformed axes. In the long term the high-level enqueue should take care of this, perhaps it is better to just add an „enqueue_transform_with_arrays_given“ method as a step towards this.

Gregor

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/geggo/gpyfft/pull/29#issuecomment-275136035, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Gs-aTiX3stGVhrn1SXR8S-esT8kfqks5rV2hfgaJpZM4LsuZ8.

SyamGadde commented 7 years ago

I agree that automatically batching across non-transformed axes would be great to integrate into gpyfft. I've created a somewhat generic mechanism to do so outside of gpyfft, but I'm afraid it's not very elegant -- perhaps it's because I haven't yet stumbled on the "trivial" solution. Anyway, when I clean it up, I'll be happy to share it.

There may still be other reasons to re-use a plan created through the high-level API, so I'll create a separate enqueue function (that takes arrays), as you suggest, and test it.

SyamGadde commented 7 years ago

Latest commits limit the entirety of the changes to fft.py. I created a new function enqueue_arrays which take data and (optional) result arrays to override the stored data/result arrays. enqueue just calls enqueue_arrays with the stored arrays since otherwise they just do the same thing.

I also found a pyopencl function that creates the sub-buffer for me so no need to call out to the OpenCL library directly.

The responsibility for automatic non-transformed axis batching would need to be shared between __init__ and enqueue_arrays because __init__ creates the clFFT plan, and you need to know how you will batch the array before creating a plan.

Here is an example of usage:

import gpyfft
import numpy
import numpy.fft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy
a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a)
b_gpu = pyopencl.array.zeros_like(a_gpu)
b2_gpu = pyopencl.array.zeros_like(a_gpu)

# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])

# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]
assert numpy.allclose(result1, numpy.fft.fftn(a[:,:,1], axes=(1,)))

# here we transform all 13 chunks
# (all but the first chunk have non-zero offsets)
events = [
    transform.enqueue_arrays(data=a_gpu[:,:,i], result=b2_gpu[:,:,i])[0]
    for i in range(13)
]
for event in events:
    event.wait()
result2 = b2_gpu.get()
assert numpy.allclose(result2, numpy.fft.fftn(a, axes=(1,)))

geggo commented 7 years ago

Excellent, thanks for your contribution!

geggo / gpyfft

Give user the ability to specify buffer offsets in enqueue_transform #29