ComputationalRadiationPhysics / student_project_python_bindings

The student project investigates the performance and memory handling of Python bindings for CUDA C++ code created with pybind11.
GNU General Public License v3.0
1 stars 0 forks source link

Implementation of a prototype GPU memory manager for sharing GPU memory between Python and C++ #13

Closed SimeonEhrig closed 3 years ago

SimeonEhrig commented 3 years ago

Short description

Implement a GPU memory, which allows to allocate and copy memory from and to the GPU via Python API and allows to pass the GPU memory as argument to a C++ CUDA code via pybind11 Python binding.

Task

You should implement the marked functions in the following code skeleton. This example is a mixture of real code and pseudo code and not totally correct. The application starts with main.py. I omit the pybind11 boiler plate code.

main.py

import cupy as cp

def alloc_gpu_memory(size):
  # TODO: implement

def copy_to_gpu(src, dest, size):
  # TODO: implement

def copy_from_gpu(src, dest, size):
  # TODO: implement

def get_the_images():
  # each image is represented by a single value
  return np.arange(40)

def stitiching(update, partial_updates):
  for i in range(4):
    update[i] = partial_updates[i][0] + partial_updates[i][1] + partial_updates[i][2] - partial_updates[i][3]

def main():
  # input data on CPU
  images = get_the_images()
  number_of_images = 40
  # generate array of size [4][10]
  parted_images = split_images(images, 4)
  # is a numpy array of 4 random values
  update = random()

  # allocate memory for the GPU, size [4]
  gpu_update_memory = alloc_gpu_memory(4)

  partial_update[4][4]

  # distribute the data over 4 GPUs
  for i in range(4):
    # size [4][10]
    gpu_image_memory[i] = alloc_gpu_memory(number_of_images/4)
    # size [4][4]
    gpu_partial_update_memory[i] = alloc_gpu_memory(4)

  # src, dest, size
  copy_to_gpu(update, gpu_update_memory, 4)

  for i in range(4):
    copy_to_gpu(parted_images[i], gpu_image_memory[i], 10)

  for k in range(iteration):
    for i in range(4):
      algorithm_binding_of_fn_algo(gpu_memory[i], gpu_update_memory, 
                                                      gpu_partioal_update_memory[i], number_of_images/4)
      copy_from_gpu(gpu_partial_update_memory[i], partial_update[i], 4)

    stitiching(update, partial_update)
    copy_to_gpu(update, gpu_update_memory, 4)
__global__ void kernel(float *image, float *update, float *partial_update, int number_of_images){
    // some calculation to simulate a algorithm
    for(int k = 0; k < 4; ++k){
      for(int i = 0; i < number_of_images, ++i){
        partial_update[k] = image[i] + update[k];
      }
    }
  }
}

py::array<float> algo(py::array<float> images, py::array<float> complete_update, 
                                  py::array<float> partial_update, int number_of_images){
  kernel<<<1,1>(get_ptr(images), get_ptr(complete_update), get_ptr(partial_update), number_of_images);
}

I suggest trying cupy first. However, feel free to use a different library, implement your own Python bindings, or change the function APIs. Only the following properties must be met:

afif-ishamsyah commented 3 years ago

Do I need to create a new repository?

afif-ishamsyah commented 3 years ago

Hi, I dont understand what this part do

partial_update[4][4]

# distribute the data over 4 GPUs
for i in range(4):
  # size [4][10]
  gpu_image_memory[i] = alloc_gpu_memory(number_of_images/4)
  # size [4][4]
  gpu_partial_update_memory[i] = alloc_gpu_memory(4)
SimeonEhrig commented 3 years ago

Do I need to create a new repository?

No, a new folder is enough.

SimeonEhrig commented 3 years ago

Hi, I dont understand what this part do

partial_update[4][4]

# distribute the data over 4 GPUs
for i in range(4):
  # size [4][10]
  gpu_image_memory[i] = alloc_gpu_memory(number_of_images/4)
  # size [4][4]
  gpu_partial_update_memory[i] = alloc_gpu_memory(4)

This part allocates memory for the GPUs 4 times. Each time for another GPU. This means, in the first iteration it allocates memory with the size of number_of_images/4 and 4 for the first GPU, in the second iteration the same size for the second GPU and so one.

afif-ishamsyah commented 3 years ago

I still don't understand. What is the difference between this 3 part?

# allocate memory for the GPU, size [4]
  gpu_update_memory = alloc_gpu_memory(4)

# size [4][10]
  gpu_image_memory[i] = alloc_gpu_memory(number_of_images/4)

# size [4][4]
gpu_partial_update_memory[i] = alloc_gpu_memory(4)
SimeonEhrig commented 3 years ago

I still don't understand. What is the difference between this 3 part?

# allocate memory for the GPU, size [4]
  gpu_update_memory = alloc_gpu_memory(4)

# size [4][10]
  gpu_image_memory[i] = alloc_gpu_memory(number_of_images/4)

# size [4][4]
gpu_partial_update_memory[i] = alloc_gpu_memory(4)

That's three different variables of memory. In CUDA C++, it would be something like

float gpu_image_memory[4]*;
float gpu_partial_update_memory[4]*;

for(int i = 0; i < 4; ++i){
  cudaMalloc(gpu_image_memory[i], (number_of_images/4)*sizeof(float));
  cudaMalloc(gpu_partial_update_memory[i], 4*sizeof(float));
}

and gpu_update_memory is a little bit special, because if we would be consisted, we have to write gpu_update_memory[4][1]. So it is simply a scalar value for each GPU, so we can skip the extra dimension.

afif-ishamsyah commented 3 years ago

I am not sure cupy is capable of that. I will look into this and other solutions

SimeonEhrig commented 3 years ago

That's possible. Please have a look, if we need to change the behavior. It is also possible to allocate and single bunch of memory and work with offsets:

float gpu_image_memory*;
cudaMalloc(gpu_image_memory, 4*(number_of_images/4)*sizeof(float));
//...

for(int i = 0; i < 4; ++i){
 // arguments: start address, end adress  
 do_stuff(gpu_image_memory + (i * (number_of_images/4)), 
          gpu_image_memory + ((i+1) * (number_of_images/4)))
}
SimeonEhrig commented 3 years ago

Done in #22