NLESC-JCER / EigenCuda

Offload Eigen operations to GPUs
Apache License 2.0
17 stars 5 forks source link

Performance optimization #3

Closed felipeZ closed 5 years ago

felipeZ commented 5 years ago
felipeZ commented 5 years ago

There are several possible optimizations for matrix-tensor multiplication:

felipeZ commented 5 years ago

Stacking and then sending the array to the device does not seem to be really efficient, mainly to the stack step. The asynchronous streaming seems more promising

felipeZ commented 5 years ago

It seems that the tensor-matrix operations that we one can be done using the batchgemm operations

felipeZ commented 5 years ago

Instead of Using gemmBatched it would be great if we can use gemmStridedBatched

felipeZ commented 5 years ago

The following memory optimization techniques have been used:

The winning strategy so far seems to be sending the Eigen matrices stored in an std::vector one by one to the device, then compute the matrix multiplication using gemmBatched and finally copy back the resulting tensor.

felipeZ commented 5 years ago

The following is the result of the nvprof using the best strategy described above:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   48.91%  63.362ms         4  15.840ms  110.98us  55.202ms  void fermiPlusDgemmLDS128_batched<bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(double**, double**, double**, double*, double const *, double const *, int, int, int, int, int, int, __int64, __int64, __int64, double const *, double const *, double, double, int)
                   27.71%  35.895ms        60  598.25us     864ns  4.4987ms  [CUDA memcpy HtoD]
                   23.39%  30.299ms        44  688.62us  1.9520us  2.5379ms  [CUDA memcpy DtoH]
      API calls:   84.33%  568.94ms       112  5.0799ms  3.6430us  540.59ms  cudaFree
                   12.88%  86.884ms       100  868.84us  6.3090us  5.1049ms  cudaMemcpyAsync
                    1.87%  12.615ms       108  116.80us  5.1790us  615.07us  cudaMalloc

Still, half of the time is spent in transferring data.