experiment with cupy and pycuda

afif-ishamsyah commented 3 years ago

There are problems with using cupy and pycuda

SimeonEhrig commented 3 years ago

cupy_pybind2.py has the right direction. But instead using a global C variable for the device memory, we need a local variable which we can handle in the python code and pass through the pybind interfaces. Did you tried to use cupy arrays like numpy arrays together with pybind11? If it works, there is a nice way for the memory handling.

Maybe you should also skip multi GPU for the moment. First develop a solution for a single GPU and then extend to multi GPU.

I found a nice example in the cupy documentation: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-kernels You can take the example and replace the add_kernel variable with your pybind11 binding. If this works, we can extend it with everything, what we need.

afif-ishamsyah commented 3 years ago

So we use RawKernel to pybind instead of global kernel, and then using it to allocate memory?

SimeonEhrig commented 3 years ago

No, I mean, take the following code:

>>> import cupy as cp
add_kernel = cp.RawKernel(r'''
... extern "C" __global__
... void my_add(const float* x1, const float* x2, float* y) {
...     int tid = blockDim.x * blockIdx.x + threadIdx.x;
...     y[tid] = x1[tid] + x2[tid];
... }
... ''', 'my_add')
>>> x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5) # gpu memory allocation with python
>>> x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
>>> y = cp.zeros((5, 5), dtype=cp.float32)
>>> add_kernel((5,), (5,), (x1, x2, y))  # grid, block and arguments
>>> y
array([[ 0.,  2.,  4.,  6.,  8.],
       [10., 12., 14., 16., 18.],
       [20., 22., 24., 26., 28.],
       [30., 32., 34., 36., 38.],
       [40., 42., 44., 46., 48.]], dtype=float32)

and transform it to:

>>> import cupy as cp
>>> import myPythonBind
>>> x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5) # gpu memory allocation with python
>>> x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
>>> y = cp.zeros((5, 5), dtype=cp.float32)
>>>  myPythonBind(x1, x2, y) # kernel and kernel launch are written in C++ and has a pybind11 binding
>>> y
array([[ 0.,  2.,  4.,  6.,  8.],
       [10., 12., 14., 16., 18.],
       [20., 22., 24., 26., 28.],
       [30., 32., 34., 36., 38.],
       [40., 42., 44., 46., 48.]], dtype=float32)

afif-ishamsyah commented 3 years ago

so the "kernel and kernel launch are written in C++ and has a pybind11 binding" part is for the partial_update part?

SimeonEhrig commented 3 years ago

Yes. But in the beginning, you should focus if it possible to pass the cupy array through the pybind11 interface. If I understand it correctly, the cupy array is the central memory object of cupy.

afif-ishamsyah commented 3 years ago

that is the hardest part because most of the time all I get is segmetation fault

SimeonEhrig commented 3 years ago

But this is the most interesting part. I think, I found the reason of the segmentation faults: https://github.com/pybind/pybind11/issues/2694

And here is a workaround: https://stackoverflow.com/questions/66989716/passing-cupy-cuda-device-pointer-to-pybind11

afif-ishamsyah commented 3 years ago

Oh, I mean I always get segmentation fault when returning the array back to cupy. For example in commit b5a6c778b6ba000e8c208edbdb81cab1446ee8d5 (2 commit before this), in gpu_algo.hpp at line 64.

For the workaround link, you can see the gpu_algo.hpp that I already done something similiar everytime I receive an array from python.

SimeonEhrig commented 3 years ago

Oh, I mean I always get segmentation fault when returning the array back to cupy. For example in commit b5a6c77 (2 commit before this), in gpu_algo.hpp at line 64.

In this line of code, I see two problems. The first is, that you don't send back an array. It's just a pointer. Can you please run type(gpu_image) to check the Python type. I'm not sure, how pointers are represented in Python. Second, you executed a print on GPU memory. This also causes an segmentation fault in C++ application, because you tried to access GPU memory directly from the CPU.

afif-ishamsyah commented 3 years ago

I check it again and segmantation fault always happens everytime I use return to a CUDA variable

SimeonEhrig commented 3 years ago

I think the main problems is, that Python does not support pointers. In this post, it is mentioned, that a raw pointer is casted to a single value: https://stackoverflow.com/questions/57990269/passing-pointer-to-c-from-python-using-pybind11

I think we, need a wrapper object, like we already have with the numpy array for the CPU side. In the post, the class py::buffer was suggested. Writing a own wrapper class is also possible or using C++ smart pointers, but I would no suggest this, because smart pointers has the same problem like raw pointers, we have enough information about the data, like the length.

SimeonEhrig commented 3 years ago

The last commit is the right direction. Allocate GPU memory on the Python side and use it on the C++ side.

Only this cast looks ugly: https://github.com/ComputationalRadiationPhysics/student_project_python_bindings/blob/d99003f3bbf8ba911e58e517c283ee3afcde77e5/gpu_memory_management/cupy_pybind.cu#L50 and the Python interface is not so nice but it works. I think we will find a better solution in future. At the moment I check, if we can implement support for cupy arrays like numpy arrays. In theory, it should be possible.

Something about memory management in cupy. The first time it can be confusing for a C++ developer. The lifetime of memory is bound to the cupy arrays. If you delete the reference to the object, the memory is unbound, e.g.:

import cupy

z = cupy.zeros(1024*1024*1024)
# delete array and unbound the GPU memory 
del z
k = cupy.zeros(1024*1024*1024)
# the array is implicit delete and the GPU memory unbound, because k does not reference to the array anymore 
k = 1

But unbound does not means delete. The nvidia-smi will still show used memory. The reason is the memory manager of cupy. Instead calling the cudaFree() function, the manager marked the memory as unused and reuse it later again. This is more efficient than calling cudaFree() and cudaMalloc again. This line of code only forces the memory manager to free all unused memory:

https://github.com/ComputationalRadiationPhysics/student_project_python_bindings/blob/d99003f3bbf8ba911e58e517c283ee3afcde77e5/gpu_memory_management/cupy_pybind4.py#L11

But in practice, it is not necessary.

I think, your next step should making your example multi-GPU capable. Means allocate memory and execute a kernel on a specific GPU by it's id. In the meantime, I check the requirements for the native cupy array support.

SimeonEhrig commented 3 years ago

Parts of the PR become part of PR #22. Therefore, it is not necessary to merge this PR anymore.

ComputationalRadiationPhysics / student_project_python_bindings

experiment with cupy and pycuda #15