Closed afif-ishamsyah closed 3 years ago
cupy_pybind2.py
has the right direction. But instead using a global C variable for the device memory, we need a local variable which we can handle in the python code and pass through the pybind interfaces. Did you tried to use cupy arrays like numpy arrays together with pybind11? If it works, there is a nice way for the memory handling.
Maybe you should also skip multi GPU for the moment. First develop a solution for a single GPU and then extend to multi GPU.
I found a nice example in the cupy documentation: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-kernels
You can take the example and replace the add_kernel
variable with your pybind11 binding. If this works, we can extend it with everything, what we need.
So we use RawKernel to pybind instead of global kernel, and then using it to allocate memory?
No, I mean, take the following code:
>>> import cupy as cp
add_kernel = cp.RawKernel(r'''
... extern "C" __global__
... void my_add(const float* x1, const float* x2, float* y) {
... int tid = blockDim.x * blockIdx.x + threadIdx.x;
... y[tid] = x1[tid] + x2[tid];
... }
... ''', 'my_add')
>>> x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5) # gpu memory allocation with python
>>> x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
>>> y = cp.zeros((5, 5), dtype=cp.float32)
>>> add_kernel((5,), (5,), (x1, x2, y)) # grid, block and arguments
>>> y
array([[ 0., 2., 4., 6., 8.],
[10., 12., 14., 16., 18.],
[20., 22., 24., 26., 28.],
[30., 32., 34., 36., 38.],
[40., 42., 44., 46., 48.]], dtype=float32)
and transform it to:
>>> import cupy as cp
>>> import myPythonBind
>>> x1 = cp.arange(25, dtype=cp.float32).reshape(5, 5) # gpu memory allocation with python
>>> x2 = cp.arange(25, dtype=cp.float32).reshape(5, 5)
>>> y = cp.zeros((5, 5), dtype=cp.float32)
>>> myPythonBind(x1, x2, y) # kernel and kernel launch are written in C++ and has a pybind11 binding
>>> y
array([[ 0., 2., 4., 6., 8.],
[10., 12., 14., 16., 18.],
[20., 22., 24., 26., 28.],
[30., 32., 34., 36., 38.],
[40., 42., 44., 46., 48.]], dtype=float32)
so the "kernel and kernel launch are written in C++ and has a pybind11 binding" part is for the partial_update part?
Yes. But in the beginning, you should focus if it possible to pass the cupy array through the pybind11 interface. If I understand it correctly, the cupy array is the central memory object of cupy.
that is the hardest part because most of the time all I get is segmetation fault
But this is the most interesting part. I think, I found the reason of the segmentation faults: https://github.com/pybind/pybind11/issues/2694
And here is a workaround: https://stackoverflow.com/questions/66989716/passing-cupy-cuda-device-pointer-to-pybind11
Oh, I mean I always get segmentation fault when returning the array back to cupy. For example in commit b5a6c778b6ba000e8c208edbdb81cab1446ee8d5 (2 commit before this), in gpu_algo.hpp at line 64.
For the workaround link, you can see the gpu_algo.hpp that I already done something similiar everytime I receive an array from python.
Oh, I mean I always get segmentation fault when returning the array back to cupy. For example in commit b5a6c77 (2 commit before this), in gpu_algo.hpp at line 64.
In this line of code, I see two problems. The first is, that you don't send back an array. It's just a pointer. Can you please run
type(gpu_image)
to check the Python type. I'm not sure, how pointers are represented in Python. Second, you executed a print on GPU memory. This also causes an segmentation fault in C++ application, because you tried to access GPU memory directly from the CPU.
I check it again and segmantation fault always happens everytime I use return to a CUDA variable
I think the main problems is, that Python does not support pointers. In this post, it is mentioned, that a raw pointer is casted to a single value: https://stackoverflow.com/questions/57990269/passing-pointer-to-c-from-python-using-pybind11
I think we, need a wrapper object, like we already have with the numpy array for the CPU side. In the post, the class py::buffer
was suggested. Writing a own wrapper class is also possible or using C++ smart pointers, but I would no suggest this, because smart pointers has the same problem like raw pointers, we have enough information about the data, like the length.
The last commit is the right direction. Allocate GPU memory on the Python side and use it on the C++ side.
Only this cast looks ugly: https://github.com/ComputationalRadiationPhysics/student_project_python_bindings/blob/d99003f3bbf8ba911e58e517c283ee3afcde77e5/gpu_memory_management/cupy_pybind.cu#L50 and the Python interface is not so nice but it works. I think we will find a better solution in future. At the moment I check, if we can implement support for cupy arrays like numpy arrays. In theory, it should be possible.
Something about memory management in cupy. The first time it can be confusing for a C++ developer. The lifetime of memory is bound to the cupy arrays. If you delete the reference to the object, the memory is unbound, e.g.:
import cupy
z = cupy.zeros(1024*1024*1024)
# delete array and unbound the GPU memory
del z
k = cupy.zeros(1024*1024*1024)
# the array is implicit delete and the GPU memory unbound, because k does not reference to the array anymore
k = 1
But unbound does not means delete. The nvidia-smi
will still show used memory. The reason is the memory manager of cupy. Instead calling the cudaFree()
function, the manager marked the memory as unused and reuse it later again. This is more efficient than calling cudaFree()
and cudaMalloc
again. This line of code only forces the memory manager to free all unused memory:
But in practice, it is not necessary.
I think, your next step should making your example multi-GPU capable. Means allocate memory and execute a kernel on a specific GPU by it's id. In the meantime, I check the requirements for the native cupy array support.
Parts of the PR become part of PR #22. Therefore, it is not necessary to merge this PR anymore.
There are problems with using cupy and pycuda