Closed SkibidiProduction closed 5 months ago
I'm not an expert in the C or CUDA, but if you replace your function NDArray_FMatmul
with the following implementation the problem will be solved.
NDArray*
NDArray_FMatmul(NDArray *a, NDArray *b) {
int* output_shape = emalloc(sizeof(int) * 2);
output_shape[0] = NDArray_SHAPE(a)[0];
output_shape[1] = NDArray_SHAPE(b)[1];
NDArray* result = NDArray_Zeros(output_shape, 2, NDARRAY_TYPE_FLOAT32, NDArray_DEVICE(a));
if (NDArray_DEVICE(a) == NDARRAY_DEVICE_GPU) {
#ifdef HAVE_CUBLAS
cublasHandle_t handle;
cublasCreate(&handle);
float* d_A;
float* d_B;
float* d_C;
size_t size_A = NDArray_NUMELEMENTS(a) * sizeof(float);
size_t size_B = NDArray_NUMELEMENTS(b) * sizeof(float);
size_t size_C = NDArray_NUMELEMENTS(result) * sizeof(float);
cudaMalloc((void**)&d_A, size_A);
cudaMalloc((void**)&d_B, size_B);
cudaMalloc((void**)&d_C, size_C);
cudaMemcpy(d_A, NDArray_FDATA(a), size_A, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, NDArray_FDATA(b), size_B, cudaMemcpyHostToDevice);
int m = NDArray_SHAPE(a)[0];
int n = NDArray_SHAPE(b)[1];
int k = NDArray_SHAPE(a)[1];
float alpha = 1.0f;
float beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, d_B, n, d_A, k, &beta, d_C, n);
cudaMemcpy(NDArray_FDATA(result), d_C, size_C, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cublasDestroy(handle);
#endif
} else {
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
NDArray_SHAPE(a)[0], NDArray_SHAPE(b)[1], NDArray_SHAPE(a)[1],
1.0f, NDArray_FDATA(a), NDArray_SHAPE(a)[1],
NDArray_FDATA(b), NDArray_SHAPE(b)[1],
0.0f, NDArray_FDATA(result), NDArray_SHAPE(b)[1]);
}
return result;
}
Benchmark with this changes:
------ CPU Benchmark ------
0 iteration duration: 0.56329798698425 seconds.
1 iteration duration: 0.60432100296021 seconds.
2 iteration duration: 0.60670709609985 seconds.
------ GPU Benchmark ------
0 iteration duration: 0.10707297325134 seconds.
1 iteration duration: 0.080431222915649 seconds.
2 iteration duration: 0.075145959854126 seconds.
Thank you very much @SkibidiProduction
I should certainly focus on an update to optimize the GPU methods for the next version, many of the manually implemented CUDA kernels work but are definitely unable to utilize the GPU in the most efficient way possible as some were implemented in the simplest way.
In some cases, like this one, I should simply remove the kernel and perform the operation using cuBLAS. I will make an effort to identify the points where this replacement can be made for the next update.
Feel free to submit a PR for this solution and we can take it from there, or if you prefer I can make the change
I just remembered that I used a kernel for the operation because for very small arrays the overhead of cuBLAS ends up making the operation slower than using the CUDA kernel.
In this case, I believe we will use both calls (cuBLAS, CUDA) depending on the size of the array.
Thanks for the answer. I look forward to your update. I will not make PR, since I do not yet have enough competence to write the code for this extension with sufficient quality.
Thanks @henrique-borba!
Thanks @andrewdalpino, I was remembering you when I launched autograd, Andrew had told me about it and I didn't even know what it was at that time haha.
Fixed in 0.5.1, closing this issue.
Describe the bug When conducting tests, I found that the speed of calculations on the GPU is extremely low (even lower than when calculating on the CPU).
Here is my results:
To Reproduce Execute the following script:
Expected behavior Computations on the GPU are faster than on the CPU. Here is my results for PyTorch:
Dumps If applicable, add NDArray low-level dumps of the relevant arrays.
Environment:
Optional: PHP Information PHP Version Info (php -v)
PHP Modules: