fjarri / reikna

Pure Python GPGPU library
http://reikna.publicfields.net/
MIT License
164 stars 16 forks source link

How to get asynchronous copies working? #47

Closed drtpotter closed 5 years ago

drtpotter commented 6 years ago

Hi there, I'm having problems getting asynchronous copies working using _fromdevice. Here is some example code, and I'm using Reikna 0.7, pycuda 2018, and CUDA 9.1 with a GTX 1060 GPU.

## Start code ##

# Python code to test asynchronous copies

import numpy as np
import time
from reikna import cluda
api=cluda.cuda_api()
shape=(int(0.25*1e9),)

# Make some threads
platforms=api.get_platforms()
devices=platforms[0].get_devices()
device=devices[0]

t1=api.Thread(device, async_=True)
t2=api.Thread(device, async_=True)

@profile
def fun(device):

    a1_dev=t1.array(shape=shape, dtype=np.float32)
    a2_dev=t2.array(shape=shape, dtype=np.float32)

    a1_cpu=np.zeros(shape=shape, dtype=np.float32)
    a2_cpu=np.zeros(shape=shape, dtype=np.float32)

    # Fetch the arrays
    t1.from_device(a1_dev, dest=a1_cpu, async_=True)
    t2.from_device(a2_dev, dest=a2_cpu, async_=True)

    # Most of the time should be spent here
    t1.synchronize()
    t2.synchronize()

fun(device)

t1.release()
t2.release()

## End code ##

If I run the Python line profiler on this

kernprof -l test_async.py
python -m line_profiler test_async.py.lprof

I get the following output.

Line # Hits Time Per Hit % Time Line Contents

18                                           @profile
19                                           def fun(device):
20                                           
21         1       1506.0   1506.0      0.2      a1_dev=t1.array(shape=shape, dtype=np.float32)
22         1       1427.0   1427.0      0.2      a2_dev=t2.array(shape=shape, dtype=np.float32)
23                                           
24         1         25.0     25.0      0.0      a1_cpu=np.zeros(shape=shape, dtype=np.float32)
25         1         10.0     10.0      0.0      a2_cpu=np.zeros(shape=shape, dtype=np.float32)
26                                           
27                                               # Fetch the arrays
28         1     354205.0 354205.0     50.2      t1.from_device(a1_dev, dest=a1_cpu, async_=True)
29         1     348243.0 348243.0     49.4      t2.from_device(a2_dev, dest=a2_cpu, async_=True)
30                                           
31                                               # Most of the time should be spent here
32         1         32.0     32.0      0.0      t1.synchronize()
33         1          7.0      7.0      0.0      t2.synchronize()

I'm expecting most of the time to be spent in the synchronize calls, however the compute time is being spent in the calls to _fromdevice, suggesting that asynchronous copies are not happening. What can I do to get this to work?

Kind regards, Toby

fjarri commented 6 years ago

It seems that it is some kind of PyCUDA issue (or an incorrect usage of GPUArray.get_async on my part). The following purely PyCUDA code demonstrates the same behavior:

import numpy as np
import time
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.gpuarray import GPUArray
import cProfile

shape=(int(0.25*1e8),)
stream = cuda.Stream()

def fun():
    a1_dev=GPUArray(shape=shape, dtype=np.float32)
    a1_cpu=np.zeros(shape=shape, dtype=np.float32)
    a1_cpu = a1_dev.get_async(stream=stream)
    stream.synchronize()

cProfile.run("fun()", sort='cumtime')

By the way, I am not sure I understand how two concurrent CUDA contexts work for you without an explicit pushing/popping. That's interesting.

fjarri commented 6 years ago

Also, specifying the array explicitly leads to the same results.

    a1_dev.get_async(stream=stream, ary=a1_cpu)
drtpotter commented 6 years ago

Hi Bogdan,

Thanks for posting that code. I also tried using the OpenCL api with both the Intel (Xeon CPU) and NVIDIA (GTX 1060) platforms. Interestingly, with the NVIDIA platform, asynchronous copies were not working, however on the Intel OpenCL platform data did get copied asynchronously.

You were correct in that with the CUDA api I had to do add some context pushing to properly release the Threads at the end. Is there any way to reduce this context uncertainty for Reikna users?

Finally, is the approach of creating multiple Thread objects the correct way to do asynchronous operations with Reikna? I'm just not sure how to utilize multiple streams or command queues in a way that abstracts over both OpenCL and CUDA.

If you like I can post your code on the Pycuda site and see if we can get a response.

Kind regards, Toby

drtpotter commented 6 years ago

Would this behavior be due to the fact that Numpy arrays are probably not being allocated from pinned memory?

drtpotter commented 6 years ago

Rewriting the code to pagelock the memory solved the problem, but I did have to use a pycuda only function to register the memory.

# Python code to test asynchronous copies

import numpy as np
import time
from reikna import cluda
import cProfile
import mmap
import ctypes

import pycuda.autoinit
import pycuda.driver as drv

api=cluda.cuda_api()

bins=int(0.5*1e9)
shape=(bins,)

# Make some threads
platforms=api.get_platforms()
devices=platforms[0].get_devices()
device=devices[0]
print(device)

t1=api.Thread(device, async_=True)
t2=api.Thread(device, async_=True)

float_type=np.float32

def alloc_pinned(shape, dtype=float_type):
    # Create a Numpy array whose memory allocation aligned to pages

    # Number of items 
    elements=np.cumprod(shape)[-1]

    # Bytes per item
    itemsize=np.dtype(dtype).itemsize

    # Allocate memory
    buffer=mmap.mmap(-1, elements*itemsize)
    array=np.frombuffer(buffer, dtype).reshape(shape)

    # Pagelock the memory
    array_pinned=drv.register_host_memory(array)

    return(array_pinned)

@profile
def fun(device):

    a1_dev=t1.array(shape=shape, dtype=float_type)
    a2_dev=t2.array(shape=shape, dtype=float_type)

    # Number of bytes per element
    bytes_element=np.dtype(float_type).itemsize

    a1_cpu=alloc_pinned(shape, float_type)
    a2_cpu=alloc_pinned(shape, float_type)

    # Register the device

    # Fetch the arrays
    t1.from_device(a1_dev, dest=a1_cpu, async_=True)
    t2.from_device(a2_dev, dest=a2_cpu, async_=True)

    # Try some functions from PYCUDA
    a1_dev.get_async(ary=a1_cpu)
    a2_dev.get_async(ary=a2_cpu)

    # Most of the time should be spent here
    t1.synchronize()
    t2.synchronize()

#cProfile.run("fun(device)", sort='cumtime')
fun(device)

t1._context.push()
t1.release()
t2._context.push()
t2.release()

Line # Hits Time Per Hit % Time Line Contents

47                                           @profile
48                                           def fun(device):
49                                           
50         1       1703.0   1703.0      0.1      a1_dev=t1.array(shape=shape, dtype=float_type)
51         1       1628.0   1628.0      0.1      a2_dev=t2.array(shape=shape, dtype=float_type)
52                                           
53                                               # Number of bytes per element
54         1          3.0      3.0      0.0      bytes_element=np.dtype(float_type).itemsize
55                                           
56         1     651331.0 651331.0     29.7      a1_cpu=alloc_pinned(shape, float_type)
57         1     763850.0 763850.0     34.8      a2_cpu=alloc_pinned(shape, float_type)
58                                           
59                                               # Register the device
60                                           
61                                               # Fetch the arrays
62         1       1513.0   1513.0      0.1      t1.from_device(a1_dev, dest=a1_cpu, async_=True)
63         1        227.0    227.0      0.0      t2.from_device(a2_dev, dest=a2_cpu, async_=True)
64                                           
65                                               # Try some functions from PYCUDA
66         1        132.0    132.0      0.0      a1_dev.get_async(ary=a1_cpu)
67         1        123.0    123.0      0.0      a2_dev.get_async(ary=a2_cpu)
68                                           
69                                               # Most of the time should be spent here
70         1     299740.0 299740.0     13.7      t1.synchronize()
71         1     473120.0 473120.0     21.6      t2.synchronize()
fjarri commented 6 years ago

You were correct in that with the CUDA api I had to do add some context pushing to properly release the Threads at the end. Is there any way to reduce this context uncertainty for Reikna users?

Finally, is the approach of creating multiple Thread objects the correct way to do asynchronous operations with Reikna? I'm just not sure how to utilize multiple streams or command queues in a way that abstracts over both OpenCL and CUDA.

I don't see any obvious solution to managing several contexts in a convenient way. I guess Reikna could maintain a duplicate stack, check if the currently used context is the top one, and if not, move the context to the top, or something like that. But to be honest, I can't think of a use case that requires having several contexts in one OS thread.

Now having several streams simultaneously can be useful. You can create a Reikna Thread using an existing context, and it will create a separate stream for itself:

import pycuda.autoinit
t1 = Thread(pycuda.autoinit.context)
t2 = Thread(pycuda.autoinit.context)

or

t1 = Thread.create()
t2 = Thread(t1._context) # currently undocumented

Now there may be all kinds of rough edges and uncertainties with this, since I haven't really used multi-stream approach myself.

As for the issue itself, I guess some kind of generic CUDA/OpenCL memory pinning function can be added (in case OpenCL needs it too).