Closed drtpotter closed 5 years ago
It seems that it is some kind of PyCUDA issue (or an incorrect usage of GPUArray.get_async
on my part). The following purely PyCUDA code demonstrates the same behavior:
import numpy as np
import time
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.gpuarray import GPUArray
import cProfile
shape=(int(0.25*1e8),)
stream = cuda.Stream()
def fun():
a1_dev=GPUArray(shape=shape, dtype=np.float32)
a1_cpu=np.zeros(shape=shape, dtype=np.float32)
a1_cpu = a1_dev.get_async(stream=stream)
stream.synchronize()
cProfile.run("fun()", sort='cumtime')
By the way, I am not sure I understand how two concurrent CUDA contexts work for you without an explicit pushing/popping. That's interesting.
Also, specifying the array explicitly leads to the same results.
a1_dev.get_async(stream=stream, ary=a1_cpu)
Hi Bogdan,
Thanks for posting that code. I also tried using the OpenCL api with both the Intel (Xeon CPU) and NVIDIA (GTX 1060) platforms. Interestingly, with the NVIDIA platform, asynchronous copies were not working, however on the Intel OpenCL platform data did get copied asynchronously.
You were correct in that with the CUDA api I had to do add some context pushing to properly release the Threads at the end. Is there any way to reduce this context uncertainty for Reikna users?
Finally, is the approach of creating multiple Thread objects the correct way to do asynchronous operations with Reikna? I'm just not sure how to utilize multiple streams or command queues in a way that abstracts over both OpenCL and CUDA.
If you like I can post your code on the Pycuda site and see if we can get a response.
Kind regards, Toby
Would this behavior be due to the fact that Numpy arrays are probably not being allocated from pinned memory?
Rewriting the code to pagelock the memory solved the problem, but I did have to use a pycuda only function to register the memory.
# Python code to test asynchronous copies
import numpy as np
import time
from reikna import cluda
import cProfile
import mmap
import ctypes
import pycuda.autoinit
import pycuda.driver as drv
api=cluda.cuda_api()
bins=int(0.5*1e9)
shape=(bins,)
# Make some threads
platforms=api.get_platforms()
devices=platforms[0].get_devices()
device=devices[0]
print(device)
t1=api.Thread(device, async_=True)
t2=api.Thread(device, async_=True)
float_type=np.float32
def alloc_pinned(shape, dtype=float_type):
# Create a Numpy array whose memory allocation aligned to pages
# Number of items
elements=np.cumprod(shape)[-1]
# Bytes per item
itemsize=np.dtype(dtype).itemsize
# Allocate memory
buffer=mmap.mmap(-1, elements*itemsize)
array=np.frombuffer(buffer, dtype).reshape(shape)
# Pagelock the memory
array_pinned=drv.register_host_memory(array)
return(array_pinned)
@profile
def fun(device):
a1_dev=t1.array(shape=shape, dtype=float_type)
a2_dev=t2.array(shape=shape, dtype=float_type)
# Number of bytes per element
bytes_element=np.dtype(float_type).itemsize
a1_cpu=alloc_pinned(shape, float_type)
a2_cpu=alloc_pinned(shape, float_type)
# Register the device
# Fetch the arrays
t1.from_device(a1_dev, dest=a1_cpu, async_=True)
t2.from_device(a2_dev, dest=a2_cpu, async_=True)
# Try some functions from PYCUDA
a1_dev.get_async(ary=a1_cpu)
a2_dev.get_async(ary=a2_cpu)
# Most of the time should be spent here
t1.synchronize()
t2.synchronize()
#cProfile.run("fun(device)", sort='cumtime')
fun(device)
t1._context.push()
t1.release()
t2._context.push()
t2.release()
47 @profile
48 def fun(device):
49
50 1 1703.0 1703.0 0.1 a1_dev=t1.array(shape=shape, dtype=float_type)
51 1 1628.0 1628.0 0.1 a2_dev=t2.array(shape=shape, dtype=float_type)
52
53 # Number of bytes per element
54 1 3.0 3.0 0.0 bytes_element=np.dtype(float_type).itemsize
55
56 1 651331.0 651331.0 29.7 a1_cpu=alloc_pinned(shape, float_type)
57 1 763850.0 763850.0 34.8 a2_cpu=alloc_pinned(shape, float_type)
58
59 # Register the device
60
61 # Fetch the arrays
62 1 1513.0 1513.0 0.1 t1.from_device(a1_dev, dest=a1_cpu, async_=True)
63 1 227.0 227.0 0.0 t2.from_device(a2_dev, dest=a2_cpu, async_=True)
64
65 # Try some functions from PYCUDA
66 1 132.0 132.0 0.0 a1_dev.get_async(ary=a1_cpu)
67 1 123.0 123.0 0.0 a2_dev.get_async(ary=a2_cpu)
68
69 # Most of the time should be spent here
70 1 299740.0 299740.0 13.7 t1.synchronize()
71 1 473120.0 473120.0 21.6 t2.synchronize()
You were correct in that with the CUDA api I had to do add some context pushing to properly release the Threads at the end. Is there any way to reduce this context uncertainty for Reikna users?
Finally, is the approach of creating multiple Thread objects the correct way to do asynchronous operations with Reikna? I'm just not sure how to utilize multiple streams or command queues in a way that abstracts over both OpenCL and CUDA.
I don't see any obvious solution to managing several contexts in a convenient way. I guess Reikna could maintain a duplicate stack, check if the currently used context is the top one, and if not, move the context to the top, or something like that. But to be honest, I can't think of a use case that requires having several contexts in one OS thread.
Now having several streams simultaneously can be useful. You can create a Reikna Thread
using an existing context, and it will create a separate stream for itself:
import pycuda.autoinit
t1 = Thread(pycuda.autoinit.context)
t2 = Thread(pycuda.autoinit.context)
or
t1 = Thread.create()
t2 = Thread(t1._context) # currently undocumented
Now there may be all kinds of rough edges and uncertainties with this, since I haven't really used multi-stream approach myself.
As for the issue itself, I guess some kind of generic CUDA/OpenCL memory pinning function can be added (in case OpenCL needs it too).
Hi there, I'm having problems getting asynchronous copies working using _fromdevice. Here is some example code, and I'm using Reikna 0.7, pycuda 2018, and CUDA 9.1 with a GTX 1060 GPU.
If I run the Python line profiler on this
I get the following output.
Line # Hits Time Per Hit % Time Line Contents
I'm expecting most of the time to be spent in the synchronize calls, however the compute time is being spent in the calls to _fromdevice, suggesting that asynchronous copies are not happening. What can I do to get this to work?
Kind regards, Toby