NCAR / micm

A model-independent chemistry module for atmosphere models
https://ncar.github.io/micm/
Apache License 2.0
6 stars 4 forks source link

Rewrite the assign function in the Rosenbrock solve function for CPU/GPU #471

Closed sjsprecious closed 3 months ago

sjsprecious commented 4 months ago

According to the discussion with @mattldawson , the assign method (e.g., https://github.com/NCAR/micm/blob/main/include/micm/solver/rosenbrock.inl#L257 and https://github.com/NCAR/micm/blob/main/include/micm/solver/rosenbrock.inl#L363) in the Solve function of Rosenbrock class (https://github.com/NCAR/micm/blob/main/include/micm/solver/rosenbrock.inl#L257) probably does not work on the GPU.

Potential solutions:

Potential questions:

Final decision:

sjsprecious commented 4 months ago

@mattldawson @mwaxmonsky Instead of overloading operator=, we could use cublasDcopy (https://docs.nvidia.com/cuda/archive/10.2/cublas/index.html#cublas-lt-t-gt-copy) for the CUDA code and dcopy (https://netlib.org/lapack/explore-html-3.6.1/de/da4/group__double__blas__level1_ga21cdaae1732dea58194c279fca30126d.html) for the CPU code. I think they will be more performant than our manual implementations here.

mwaxmonsky commented 4 months ago

@sjsprecious I might be misunderstanding so my apologies in advanced but I think we would use these as a part of the operator= overload? Meaning we would need something like:

operator=(CudaVectorMatrix to) {
    cublasDcopy(this->cudablas_handle, this->AsVector().size(),
                this->param.d_data, 1, to.param.d_data, 1);
}

or

CudaVectorMatrix::assign(CudaVectorMatrix from) {
  cublasDcopy(this->cudablas_handle, this->AsVector().size(),
                from.param.d_data, 1, this->param.d_data, 1);
}

since the generic API would be mat_b = mat_a or mat_b.assign(mat_a), respectively.

We can definitely look into using lapack for the host side copy but I would recommend an implementation written in C++ like eigen or magma if we're brining in a BLAS package.

sjsprecious commented 4 months ago

Thanks @mwaxmonsky . I think you are definitely right here and the implementation you provided makes sense to me.

Actually the BLAS version of copy seems to run much slower than the current assign implementation on the CPU (see the example below. I compiled the code with intel compiler through icpc example.cpp -lmkl_rt on Derecho).

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <algorithm>
#include <numeric>
#include <mkl.h>

int main() {
    std::vector<double> x(1e7), y(1e7);
    std::iota(x.begin(), x.end(), 1);
    std::iota(y.begin(), y.end(), 1);

    auto start = std::chrono::high_resolution_clock::now();
    y.assign(x.begin(), x.end());
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "y.assign(x.begin(), x.end()): " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    cblas_dcopy(10000000, x.data(), 1, y.data(), 1);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "cblas_dcopy(x.size(), x.data(), 1, y.data(), 1): " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms" << std::endl;

    return 0;
}

I got something like below:

y.assign(x.begin(), x.end()): 4 ms
cblas_dcopy(x.size(), x.data(), 1, y.data(), 1): 50 ms

Regarding the performance between CuBLAS and CudaMemcpy (device to device), I have the following example (I compiled the code with nvhpc compiler through nvcc example.cpp -lcublas on Derecho's A100 GPU).

#include <iostream>
#include <ctime>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <chrono>

int main(){
    int N = 1000000;
    double *x, *y;
    cudaMalloc(&x, N * sizeof(double));
    cudaMalloc(&y, N * sizeof(double));

    double *h_x = new double[N];
    double *h_y = new double[N];
    for (int i = 0; i < N; i++){
        h_x[i] = 1.0;
        h_y[i] = 0.0;
    }
    cudaMemcpy(x, h_x, N * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(y, h_y, N * sizeof(double), cudaMemcpyHostToDevice);

    cublasHandle_t handle;
    cublasCreate(&handle);
    auto start = std::chrono::high_resolution_clock::now(); 
    cublasDcopy(handle, N, x, 1, y, 1);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "cublas dcopy time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " ms" << std::endl;

    start = std::chrono::high_resolution_clock::now();
    cudaMemcpy(y, x, N * sizeof(double), cudaMemcpyDeviceToDevice);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "cudaMemcpy time: " << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " ms" << std::endl;

    cudaFree(x);
    cudaFree(y);
    cublasDestroy(handle);

    delete[] h_x;
    delete[] h_y;
    return 0;
}

And I got the following performance number:

cublas dcopy time: 244 ms
cudaMemcpy time: 15 ms

That is to say, if these two examples are correctly implemented, what we are doing right now seems the most performant on CPU and GPU, and I am providing some suggestions that make nonsense. Sorry about the confusion.