GPU: Asynchronous data transfers and kernels - Heterogeneous computation

cguzman95 commented 4 years ago

I will work on this in the branch for #129. This issue is to document all ideas for asynchronous GPU execution, allowing GPU and CPU computation simultaneously.

cguzman95 commented 4 years ago

Adding model_data_id variable on ModelData struct. This variable will identify the ModelData objects created, at least with the GPU flag ON.

Reason: For asynchronous memcpy between GPU and CPU, it's necessary to define a stream. Example:

cudaStream_t stream[nStreams];
  for (int i = 0; i < nStreams; ++i)
    checkCuda( cudaStreamCreate(&stream[i]) );
    cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream[i])
    cudaStreamDestroy(stream[i]);

So, the stream needs to be declared on new_solver and be destroyed alongside the deallocation of other structures. This means it needs to be declared at the start of the camp_gpu_solver.cu file (or declared inside ModelData, but for the moment is global since it gives me an error trying to declare cudaStream_tin a .h file)

But what happens when we have multiple ModelData objects in an execution? (example: new unit tests, with multi-cell and one-cell solver). Both will try to create streams with the same id. So, this means they will share the same stream.

This could seem like no problem since in principle they won't call solve at the same time. But, it's possible that in the future we (or the user) want to divide multiple solvers into individual CPU threads (with MPI for example). In the case all threads use the same GPU, the GPU execution will slow down since all threads will try to use the same stream.

Not only this, if for some reason the user destroys a solver object, it will destroy also the global streams, so if some solver is still in execution, it will crash.

In conclusion: A specific id of ModelData objects is necessary in order to assign different streams for each possible ModelData object.

mattldawson commented 4 years ago

Hi @cguzman95 - I agree it's important to allow multiple instances of the CAMP core to run simultaneously. I've been talking to people at NCAR that are interested in possibly using CAMP once it's ready and they will require multiple cores to run at the same time on different threads using OpenMP. So we have to make sure there are no global variables (I don't think there are currently) and no fortran module variables (I think there are only constants right now). I think your original idea of including the cudaStream_t in ModelData is the best design. Trying to internally manage externally generated instances of the CAMP core (and thus ModelData) using ids is going to get too complicated. What is the error you're getting when you try to include the cudaStream_t in ModelData?

cguzman95 commented 4 years ago

Hi,

Yep, I agree with setting cudaStream_t in ModelData (I'm only using it as a global variable as a temporal "patch" to continue developing). Speaking of the error, only setting this two lines in any .h file:

#include <cuda.h>
cudaStream_t *stream_gpu;

Raise the error:

/gpfs/scratch/bsc32/bsc32815/gpupartmc/partmc/src/camp_common.h:198:3: error: unknown type name ‘cudaStream_t’
   cudaStream_t *stream_gpu;

Not sure the reason, maybe we are missing some configuration on the CMake?

mattldawson commented 4 years ago

do you also need: #include <cuda_runtime.h>?

cguzman95 commented 4 years ago

Yep, it's compiling fine now

compdyn / partmc

GPU: Asynchronous data transfers and kernels - Heterogeneous computation #132