Closed SyamGadde closed 7 years ago
Hi, thanks for providing a way to allow more general memory layouts and resolve issue #10. I need more time to look into this.
After having a quick glance at the code I have some comments/questions:
Sure, I don't see why it can't be done all in the high-level API -- in that case the same thing could be done in enqueue() instead. I'll amend this in a bit...
OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data
and result
attributes and allow users to call enqueue
repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.
Am 25.01.2017 um 16:21 schrieb Syam Gadde notifications@github.com:
OK, one reason I changed the low-level enqueue_transform was because that was the only way I could re-use the baked plan for multiple arrays (or multiple slices of the same array). Would it be acceptable to extend the high-level enqueue() to take optional data/result keyword arguments? If specified, they would override the stored data and result attributes and allow users to call enqueue repeatedly. It could double check that the shape/strides/dtype match the original data used to create the plan. If those arguments are not specified, it would use the stored values as normal.
Ah, I see, you want to iterate over slices while keeping the plan. Adding optional in/out arrays as replacements for the stored ones for the high-level enqueue is a possibility, ok for me.
Ultimately, as I understood, all this is needed to allow for an extended batching scheme with more than one non-transformed axes. In the long term the high-level enqueue should take care of this, perhaps it is better to just add an „enqueue_transform_with_arrays_given“ method as a step towards this.
Gregor
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/geggo/gpyfft/pull/29#issuecomment-275136035, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Gs-aTiX3stGVhrn1SXR8S-esT8kfqks5rV2hfgaJpZM4LsuZ8.
I agree that automatically batching across non-transformed axes would be great to integrate into gpyfft. I've created a somewhat generic mechanism to do so outside of gpyfft, but I'm afraid it's not very elegant -- perhaps it's because I haven't yet stumbled on the "trivial" solution. Anyway, when I clean it up, I'll be happy to share it.
There may still be other reasons to re-use a plan created through the high-level API, so I'll create a separate enqueue function (that takes arrays), as you suggest, and test it.
Latest commits limit the entirety of the changes to fft.py. I created a new function enqueue_arrays
which take data and (optional) result arrays to override the stored data/result arrays. enqueue
just calls enqueue_arrays
with the stored arrays since otherwise they just do the same thing.
I also found a pyopencl function that creates the sub-buffer for me so no need to call out to the OpenCL library directly.
The responsibility for automatic non-transformed axis batching would need to be shared between __init__
and enqueue_arrays
because __init__
creates the clFFT plan, and you need to know how you will batch the array before creating a plan.
Here is an example of usage:
import gpyfft
import numpy
import numpy.fft
import pyopencl
import pyopencl.array
context = pyopencl.create_some_context()
queue = pyopencl.CommandQueue(context)
import numpy
a = numpy.zeros([1024, 256, 13], order='F', dtype=numpy.complex64)
a[:,:,:] = numpy.arange(256)[numpy.newaxis, :, numpy.newaxis]
a_gpu = pyopencl.array.to_device(queue, a)
b_gpu = pyopencl.array.zeros_like(a_gpu)
b2_gpu = pyopencl.array.zeros_like(a_gpu)
# notice here we use a sliced array which has a non-zero offset
transform = gpyfft.fft.FFT(context, queue, a_gpu[:,:,1], axes = (1,), out_array=b_gpu[:,:,1])
# here we transform using the high-level API (the non-zero offset would have failed here before these commits)
(event,) = transform.enqueue()
event.wait()
result1 = b_gpu.get()[:,:,1]
assert numpy.allclose(result1, numpy.fft.fftn(a[:,:,1], axes=(1,)))
# here we transform all 13 chunks
# (all but the first chunk have non-zero offsets)
events = [
transform.enqueue_arrays(data=a_gpu[:,:,i], result=b2_gpu[:,:,i])[0]
for i in range(13)
]
for event in events:
event.wait()
result2 = b2_gpu.get()
assert numpy.allclose(result2, numpy.fft.fftn(a, axes=(1,)))
Excellent, thanks for your contribution!
Sorry for the raft of pull requests -- these are all modifications I've been using successfully for a while and have found very useful, and would welcome your input.
This set of commits adds
in_offsets
andout_offsets
toenqueue_transform()
, which, if specified, triggers the creation of sub-buffers under the hood. For those arrays that include multiple non-transformed axes that can't be "collapsed" into a single axis, this change allows the user to slice the input array into appropriate chunks (using the standard slice operator) and simply sendx.base_data
andx.offset
as the inputs/outputs -- though only as long as the offset is a multiple ofCL_DEVICE_MEM_BASE_ADDR_ALIGN
. It also allows for transforming arrays that have non-zero offsets. For example, to partially address issue #10 without needing to create copies of the original array or create our own OpenCL sub-buffers: