geggo / gpyfft

python wrapper for the OpenCL FFT library clFFT
GNU Lesser General Public License v3.0
54 stars 21 forks source link

Give user the ability to specify buffer offsets in enqueue_transform #29

Closed SyamGadde closed 7 years ago

SyamGadde commented 7 years ago

Sorry for the raft of pull requests -- these are all modifications I've been using successfully for a while and have found very useful, and would welcome your input.

This set of commits adds in_offsets and out_offsets to enqueue_transform(), which, if specified, triggers the creation of sub-buffers under the hood. For those arrays that include multiple non-transformed axes that can't be "collapsed" into a single axis, this change allows the user to slice the input array into appropriate chunks (using the standard slice operator) and simply send x.base_data and x.offset as the inputs/outputs -- though only as long as the offset is a multiple of CL_DEVICE_MEM_BASE_ADDR_ALIGN. It also allows for transforming arrays that have non-zero offsets. For example, to partially address issue #10 without needing to create copies of the original array or create our own OpenCL sub-buffers:

import gpyfft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy; a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a); b_gpu = pyopencl.array.zeros_like(a_gpu)

# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])

# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]

# here we transform all 13 chunks using the low-level API (so we don't have to re-create the plan)
# (all but the first chunk have non-zero offsets)
events = [
    transform.plan.enqueue_transform((queue,), (a_gpu.base_data,), (b_gpu.base_data,), in_offsets=(a_gpu[:,:,i].offset,), out_offsets=(b_gpu[:,:,i].offset,))[0]
    for i in range(13)
]
for event in events:
    event.wait()
result2 = b_gpu.get()
geggo commented 7 years ago

Hi, thanks for providing a way to allow more general memory layouts and resolve issue #10. I need more time to look into this.

After having a quick glance at the code I have some comments/questions:

SyamGadde commented 7 years ago

Sure, I don't see why it can't be done all in the high-level API -- in that case the same thing could be done in enqueue() instead. I'll amend this in a bit...

SyamGadde commented 7 years ago

OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data and result attributes and allow users to call enqueue repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.

geggo commented 7 years ago

Am 25.01.2017 um 16:21 schrieb Syam Gadde notifications@github.com:

OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data and result attributes and allow users to call enqueue repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.

Ah, I see, you want to iterate over slices while keeping the plan. Adding optional in/out arrays as replacements for the stored ones for the high-level enqueue is a possibility, ok for me.

Ultimately, as I understood, all this is needed to allow for an extended batching scheme with more than one non-transformed axes. In the long term the high-level enqueue should take care of this, perhaps it is better to just add an „enqueue_transform_with_arrays_given“ method as a step towards this.

Gregor

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/geggo/gpyfft/pull/29#issuecomment-275136035, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Gs-aTiX3stGVhrn1SXR8S-esT8kfqks5rV2hfgaJpZM4LsuZ8.

SyamGadde commented 7 years ago

I agree that automatically batching across non-transformed axes would be great to integrate into gpyfft. I've created a somewhat generic mechanism to do so outside of gpyfft, but I'm afraid it's not very elegant -- perhaps it's because I haven't yet stumbled on the "trivial" solution. Anyway, when I clean it up, I'll be happy to share it.

There may still be other reasons to re-use a plan created through the high-level API, so I'll create a separate enqueue function (that takes arrays), as you suggest, and test it.

SyamGadde commented 7 years ago

Latest commits limit the entirety of the changes to fft.py. I created a new function enqueue_arrays which take data and (optional) result arrays to override the stored data/result arrays. enqueue just calls enqueue_arrays with the stored arrays since otherwise they just do the same thing.

I also found a pyopencl function that creates the sub-buffer for me so no need to call out to the OpenCL library directly.

The responsibility for automatic non-transformed axis batching would need to be shared between __init__ and enqueue_arrays because __init__ creates the clFFT plan, and you need to know how you will batch the array before creating a plan.

Here is an example of usage:

import gpyfft
import numpy
import numpy.fft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy
a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a)
b_gpu = pyopencl.array.zeros_like(a_gpu)
b2_gpu = pyopencl.array.zeros_like(a_gpu)

# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])

# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]
assert numpy.allclose(result1, numpy.fft.fftn(a[:,:,1], axes=(1,)))

# here we transform all 13 chunks
# (all but the first chunk have non-zero offsets)
events = [
    transform.enqueue_arrays(data=a_gpu[:,:,i], result=b2_gpu[:,:,i])[0]
    for i in range(13)
]
for event in events:
    event.wait()
result2 = b2_gpu.get()
assert numpy.allclose(result2, numpy.fft.fftn(a, axes=(1,)))
geggo commented 7 years ago

Excellent, thanks for your contribution!