NumPower / numpower

PHP extension for efficient scientific computing and array manipulation with GPU support
https://numpower.org
Other
185 stars 4 forks source link

Low computing speed #51

Closed SkibidiProduction closed 2 months ago

SkibidiProduction commented 2 months ago

Describe the bug When conducting tests, I found that the speed of calculations on the GPU is extremely low (even lower than when calculating on the CPU).

Here is my results:

------ CPU Benchmark ------

0 iteration duration: 0.55710983276367 seconds.
1 iteration duration: 0.56731915473938 seconds.
2 iteration duration: 0.60753011703491 seconds.

------ GPU Benchmark ------

0 iteration duration: 0.95671892166138 seconds.
1 iteration duration: 0.94768095016479 seconds.
2 iteration duration: 0.94867181777954 seconds.

To Reproduce Execute the following script:

<?php

use \NDArray as nd;

$matrix = [];

for ($i = 0; $i < 4096; $i++) {
   for ($j = 0; $j < 4096; $j++) {
      $matrix[$i][] = random_int(0, 10);
   }
}

$x_cpu = nd::array($matrix)->cpu();
$y_cpu = nd::array($matrix)->cpu();

echo PHP_EOL;

echo "------ CPU Benchmark ------" . PHP_EOL . PHP_EOL;

for ($i = 0; $i < 3; $i++) {
   $startTime = microtime(true);
   nd::matmul($x_cpu, $y_cpu);
   echo "$i iteration duration: " . microtime(true) - $startTime . " seconds." . PHP_EOL;
}

$x_gpu = nd::array($matrix)->gpu();
$y_gpu = nd::array($matrix)->gpu();

echo PHP_EOL;

echo "------ GPU Benchmark ------" . PHP_EOL . PHP_EOL;

for ($i = 0; $i < 3; $i++) {
   $startTime = microtime(true);
   nd::matmul($x_gpu, $y_gpu);
   echo "$i iteration duration: " . microtime(true) - $startTime . " seconds." . PHP_EOL;
}
echo PHP_EOL;

Expected behavior Computations on the GPU are faster than on the CPU. Here is my results for PyTorch:

------ CPU Benchmark ------

0 iteration duration: 0.49809885025024414 seconds
1 iteration duration: 0.5133359432220459 seconds
2 iteration duration: 0.5611481666564941 seconds

------ GPU Benchmark ------

0 iteration duration: 0.052846670150756836 seconds
1 iteration duration: 0.05301022529602051 seconds
2 iteration duration: 0.04915332794189453 seconds

Dumps If applicable, add NDArray low-level dumps of the relevant arrays.

$x_gpu->dump();
=================================================
NDArray.uuid            3
NDArray.ndim            2
NDArray.dims            [ 4096 4096 ]
NDArray.strides         [ 16384 4 ]
NDArray.device          (1) GPU
NDArray.refcount        1
NDArray.descriptor.elsize   4
NDArray.descriptor.numElements  16777216
NDArray.descriptor.type     float32
NDArray.iterator.current_index  0
=================================================
$y_gpu->dump();
=================================================
NDArray.uuid            4
NDArray.ndim            2
NDArray.dims            [ 4096 4096 ]
NDArray.strides         [ 16384 4 ]
NDArray.device          (1) GPU
NDArray.refcount        1
NDArray.descriptor.elsize   4
NDArray.descriptor.numElements  16777216
NDArray.descriptor.type     float32
NDArray.iterator.current_index  0
=================================================

Environment:

Optional: PHP Information PHP Version Info (php -v)

PHP 8.3.0 (cli) (built: Jun 19 2024 03:38:22) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.3.0, Copyright (c) Zend Technologies

PHP Modules:

[PHP Modules]
Core
date
libxml
openssl
pcre
sqlite3
zlib
bz2
ctype
curl
dom
fileinfo
filter
gd
hash
iconv
json
mbstring
SPL
session
PDO
standard
mysqlnd
pdo_sqlite
Phar
posix
random
readline
Reflection
pdo_mysql
SimpleXML
tokenizer
xml
xmlreader
xmlwriter
zip
NumPower

[Zend Modules]
SkibidiProduction commented 2 months ago

I'm not an expert in the C or CUDA, but if you replace your function NDArray_FMatmul with the following implementation the problem will be solved.

NDArray*
NDArray_FMatmul(NDArray *a, NDArray *b) {
    int* output_shape = emalloc(sizeof(int) * 2);
    output_shape[0] = NDArray_SHAPE(a)[0];
    output_shape[1] = NDArray_SHAPE(b)[1];

    NDArray* result = NDArray_Zeros(output_shape, 2, NDARRAY_TYPE_FLOAT32, NDArray_DEVICE(a));

    if (NDArray_DEVICE(a) == NDARRAY_DEVICE_GPU) {
#ifdef HAVE_CUBLAS
        cublasHandle_t handle;
        cublasCreate(&handle);

        float* d_A;
        float* d_B;
        float* d_C;
        size_t size_A = NDArray_NUMELEMENTS(a) * sizeof(float);
        size_t size_B = NDArray_NUMELEMENTS(b) * sizeof(float);
        size_t size_C = NDArray_NUMELEMENTS(result) * sizeof(float);

        cudaMalloc((void**)&d_A, size_A);
        cudaMalloc((void**)&d_B, size_B);
        cudaMalloc((void**)&d_C, size_C);

        cudaMemcpy(d_A, NDArray_FDATA(a), size_A, cudaMemcpyHostToDevice);
        cudaMemcpy(d_B, NDArray_FDATA(b), size_B, cudaMemcpyHostToDevice);

        int m = NDArray_SHAPE(a)[0];
        int n = NDArray_SHAPE(b)[1];
        int k = NDArray_SHAPE(a)[1];
        float alpha = 1.0f;
        float beta = 0.0f;

        cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, d_B, n, d_A, k, &beta, d_C, n);
        cudaMemcpy(NDArray_FDATA(result), d_C, size_C, cudaMemcpyDeviceToHost);

        cudaFree(d_A);
        cudaFree(d_B);
        cudaFree(d_C);
        cublasDestroy(handle);
#endif
    } else {
        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                    NDArray_SHAPE(a)[0], NDArray_SHAPE(b)[1], NDArray_SHAPE(a)[1],
                    1.0f, NDArray_FDATA(a), NDArray_SHAPE(a)[1],
                    NDArray_FDATA(b), NDArray_SHAPE(b)[1],
                    0.0f, NDArray_FDATA(result), NDArray_SHAPE(b)[1]);
    }
    return result;
}

Benchmark with this changes:

------ CPU Benchmark ------

0 iteration duration: 0.56329798698425 seconds.
1 iteration duration: 0.60432100296021 seconds.
2 iteration duration: 0.60670709609985 seconds.

------ GPU Benchmark ------

0 iteration duration: 0.10707297325134 seconds.
1 iteration duration: 0.080431222915649 seconds.
2 iteration duration: 0.075145959854126 seconds.
henrique-borba commented 2 months ago

Thank you very much @SkibidiProduction

I should certainly focus on an update to optimize the GPU methods for the next version, many of the manually implemented CUDA kernels work but are definitely unable to utilize the GPU in the most efficient way possible as some were implemented in the simplest way.

In some cases, like this one, I should simply remove the kernel and perform the operation using cuBLAS. I will make an effort to identify the points where this replacement can be made for the next update.

Feel free to submit a PR for this solution and we can take it from there, or if you prefer I can make the change

henrique-borba commented 2 months ago

I just remembered that I used a kernel for the operation because for very small arrays the overhead of cuBLAS ends up making the operation slower than using the CUDA kernel.

In this case, I believe we will use both calls (cuBLAS, CUDA) depending on the size of the array.

SkibidiProduction commented 2 months ago

Thanks for the answer. I look forward to your update. I will not make PR, since I do not yet have enough competence to write the code for this extension with sufficient quality.

andrewdalpino commented 2 months ago

Thanks @henrique-borba!

henrique-borba commented 2 months ago

Thanks @andrewdalpino, I was remembering you when I launched autograd, Andrew had told me about it and I didn't even know what it was at that time haha.

Fixed in 0.5.1, closing this issue.