Performance optimization

felipeZ commented 5 years ago

There are several possible optimizations for matrix-tensor multiplication:

[x] Create a continuos memory representation of the tensor and send it to the device.

felipeZ commented 5 years ago

Stacking and then sending the array to the device does not seem to be really efficient, mainly to the stack step. The asynchronous streaming seems more promising

felipeZ commented 5 years ago

It seems that the tensor-matrix operations that we one can be done using the batchgemm operations

felipeZ commented 5 years ago

Instead of Using gemmBatched it would be great if we can use gemmStridedBatched

felipeZ commented 5 years ago

The following memory optimization techniques have been used:

Unified memory: Though it is easier to work with than manual allocation it performs works.
Asynchronous multistreams: This technique is useful if the operations can be interleaved with memory transfer. But, the tensor-matrix multiplication that we are targeting required that all the arrays are in memory before starting the operation
Align tensor memory in a temporal array: The std::vector array representing the 3D tensor could be stacked in a temporal location and directly sent to the device, instead of sending a single matrix at a time. However, the overhead of the (copy) intermediate temporal arrays is more expensive than sending one Matrix at a time. It the tensors could be continuous in memory then the CuBlas gemmStridedBatched can be used for the matrix multiplications.
Registering an existing host memory: This technique allows to registers arrays already available on the host for use by CUDA. Unfortunately, this technique does not seem to have a significant performance advantage in the case of moving data to and from dynamically allocated Eigen matrices.

The winning strategy so far seems to be sending the Eigen matrices stored in an std::vector one by one to the device, then compute the matrix multiplication using gemmBatched and finally copy back the resulting tensor.

felipeZ commented 5 years ago

The following is the result of the nvprof using the best strategy described above:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   48.91%  63.362ms         4  15.840ms  110.98us  55.202ms  void fermiPlusDgemmLDS128_batched<bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(double**, double**, double**, double*, double const *, double const *, int, int, int, int, int, int, __int64, __int64, __int64, double const *, double const *, double, double, int)
                   27.71%  35.895ms        60  598.25us     864ns  4.4987ms  [CUDA memcpy HtoD]
                   23.39%  30.299ms        44  688.62us  1.9520us  2.5379ms  [CUDA memcpy DtoH]
      API calls:   84.33%  568.94ms       112  5.0799ms  3.6430us  540.59ms  cudaFree
                   12.88%  86.884ms       100  868.84us  6.3090us  5.1049ms  cudaMemcpyAsync
                    1.87%  12.615ms       108  116.80us  5.1790us  615.07us  cudaMalloc

Still, half of the time is spent in transferring data.

NLESC-JCER / EigenCuda

Performance optimization #3