Rewrite the assign function in the Rosenbrock solve function for CPU/GPU

sjsprecious commented 4 months ago

According to the discussion with @mattldawson , the assign method (e.g., https://github.com/NCAR/micm/blob/main/include/micm/solver/rosenbrock.inl#L257 and https://github.com/NCAR/micm/blob/main/include/micm/solver/rosenbrock.inl#L363) in the Solve function of Rosenbrock class (https://github.com/NCAR/micm/blob/main/include/micm/solver/rosenbrock.inl#L257) probably does not work on the GPU.

Potential solutions:

Overload the assign method for GPU
Implement a new operator= in the matrix constructor to achieve the same functionality of assign (maybe we have one already?)

Potential questions:

Shall we do host->host, device->device, or all together for the assign operation?

Final decision:

Implement a function called Copy in both CPU and GPU dense matrix class, where in the CPU class, this function just calls assign method of std::vector class; in the GPU class, this function just does CudaMemcpy from device to device
Add an assert check to make sure that the size of both matrices must match
In the existing Copy assignment operator of GPU matrix class, deallocate the GPU memory space if the d_data_ pointer is not nullptr.

sjsprecious commented 4 months ago

@mattldawson @mwaxmonsky Instead of overloading operator=, we could use cublasDcopy (https://docs.nvidia.com/cuda/archive/10.2/cublas/index.html#cublas-lt-t-gt-copy) for the CUDA code and dcopy (https://netlib.org/lapack/explore-html-3.6.1/de/da4/group__double__blas__level1_ga21cdaae1732dea58194c279fca30126d.html) for the CPU code. I think they will be more performant than our manual implementations here.

mwaxmonsky commented 4 months ago

@sjsprecious I might be misunderstanding so my apologies in advanced but I think we would use these as a part of the operator= overload? Meaning we would need something like:

operator=(CudaVectorMatrix to) {
    cublasDcopy(this->cudablas_handle, this->AsVector().size(),
                this->param.d_data, 1, to.param.d_data, 1);
}

or

CudaVectorMatrix::assign(CudaVectorMatrix from) {
  cublasDcopy(this->cudablas_handle, this->AsVector().size(),
                from.param.d_data, 1, this->param.d_data, 1);
}

since the generic API would be mat_b = mat_a or mat_b.assign(mat_a), respectively.

We can definitely look into using lapack for the host side copy but I would recommend an implementation written in C++ like eigen or magma if we're brining in a BLAS package.

sjsprecious commented 4 months ago

Thanks @mwaxmonsky . I think you are definitely right here and the implementation you provided makes sense to me.

Actually the BLAS version of copy seems to run much slower than the current assign implementation on the CPU (see the example below. I compiled the code with intel compiler through icpc example.cpp -lmkl_rt on Derecho).

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <algorithm>
#include <numeric>
#include <mkl.h>

int main() {
    std::vector<double> x(1e7), y(1e7);
    std::iota(x.begin(), x.end(), 1);
    std::iota(y.begin(), y.end(), 1);

    auto start = std::chrono::high_resolution_clock::now();
    y.assign(x.begin(), x.end());
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "y.assign(x.begin(), x.end()): " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    cblas_dcopy(10000000, x.data(), 1, y.data(), 1);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "cblas_dcopy(x.size(), x.data(), 1, y.data(), 1): " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;

    return 0;
}

I got something like below:

y.assign(x.begin(), x.end()): 4 ms
cblas_dcopy(x.size(), x.data(), 1, y.data(), 1): 50 ms

Regarding the performance between CuBLAS and CudaMemcpy (device to device), I have the following example (I compiled the code with nvhpc compiler through nvcc example.cpp -lcublas on Derecho's A100 GPU).

#include <iostream>
#include <ctime>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <chrono>

int main(){
    int N = 1000000;
    double *x, *y;
    cudaMalloc(&x, N * sizeof(double));
    cudaMalloc(&y, N * sizeof(double));

    double *h_x = new double[N];
    double *h_y = new double[N];
    for (int i = 0; i < N; i++){
        h_x[i] = 1.0;
        h_y[i] = 0.0;
    }
    cudaMemcpy(x, h_x, N * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(y, h_y, N * sizeof(double), cudaMemcpyHostToDevice);

    cublasHandle_t handle;
    cublasCreate(&handle);
    auto start = std::chrono::high_resolution_clock::now(); 
    cublasDcopy(handle, N, x, 1, y, 1);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "cublas dcopy time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " ms" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    cudaMemcpy(y, x, N * sizeof(double), cudaMemcpyDeviceToDevice);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "cudaMemcpy time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " ms" << std::endl;

    cudaFree(x);
    cudaFree(y);
    cublasDestroy(handle);

    delete[] h_x;
    delete[] h_y;
    return 0;
}

And I got the following performance number:

cublas dcopy time: 244 ms
cudaMemcpy time: 15 ms

That is to say, if these two examples are correctly implemented, what we are doing right now seems the most performant on CPU and GPU, and I am providing some suggestions that make nonsense. Sorry about the confusion.

NCAR / micm

Rewrite the assign function in the Rosenbrock solve function for CPU/GPU #471