Closed felipeZ closed 5 years ago
There are several possible optimizations for matrix-tensor multiplication:
Stacking and then sending the array to the device does not seem to be really efficient, mainly to the stack step. The asynchronous streaming seems more promising
It seems that the tensor-matrix operations that we one can be done using the batchgemm operations
Instead of Using gemmBatched it would be great if we can use gemmStridedBatched
The following memory optimization techniques have been used:
std::vector
array representing the 3D tensor could be stacked in a temporal location and directly sent to the device, instead of sending a single matrix at a time. However, the overhead of the (copy) intermediate temporal arrays is more expensive than sending one Matrix at a time. It the tensors could be continuous in memory then the CuBlas gemmStridedBatched can be used for the matrix multiplications.The winning strategy so far seems to be sending the Eigen matrices stored in an std::vector
one by one to the device, then compute the matrix multiplication using gemmBatched and finally copy back the resulting tensor.
The following is the result of the nvprof using the best strategy described above:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 48.91% 63.362ms 4 15.840ms 110.98us 55.202ms void fermiPlusDgemmLDS128_batched<bool=0, bool=0, int=4, int=4, int=4, int=3, int=3, bool=1, bool=0>(double**, double**, double**, double*, double const *, double const *, int, int, int, int, int, int, __int64, __int64, __int64, double const *, double const *, double, double, int)
27.71% 35.895ms 60 598.25us 864ns 4.4987ms [CUDA memcpy HtoD]
23.39% 30.299ms 44 688.62us 1.9520us 2.5379ms [CUDA memcpy DtoH]
API calls: 84.33% 568.94ms 112 5.0799ms 3.6430us 540.59ms cudaFree
12.88% 86.884ms 100 868.84us 6.3090us 5.1049ms cudaMemcpyAsync
1.87% 12.615ms 108 116.80us 5.1790us 615.07us cudaMalloc
Still, half of the time is spent in transferring data.