arrayfire / arrayfire

ArrayFire: a general purpose GPU library.
https://arrayfire.com
BSD 3-Clause "New" or "Revised" License
4.56k stars 535 forks source link

Sparse-dense matmul with AF_MAT_TRANS very slow in OpenCL on Nvidia card #2294

Open villekf opened 6 years ago

villekf commented 6 years ago

I compute the column indices and non-zero values to the sparse matrix in an OpenCL kernel iteratively (new vectors at each iteration). The code for the sparse matrix creation is as follows:

array s_elements = afcl::array(summa[osa_iter], d_elements, f32, true); array s_indices = afcl::array(summa[osa_iter], d_indices, s32, true); array H = sparse(row_csr.dims(0) - 1, im_dim, s_elements, row_csr, s_indices);

Using this sparse matrix in matmul with AF_MAT_TRANS option causes the operation to take over a minute on a Tesla P100 while on a CPU (using OpenCL backend as well) the time is only a few seconds. Without AF_MAT_TRANS everything works really fast on the GPU as well. The matmul code is:

array Summ = matmul(H, constant(1, H.dims(0), 1, f32), AF_MAT_TRANS, AF_MAT_NONE);

AF info:

ArrayFire v3.7.0 (OpenCL, 64-bit Linux, build d25ab30) [0] NVIDIA: Tesla P100-PCIE-16GB, 16280 MB -1- NVIDIA: Quadro K620, 2001 MB -2- INTEL: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 257831 MB

9prady9 commented 6 years ago

How is the performance on CUDA backend ? is it identical to OpenCL backend on that device ? How is the performance on Quadro K620 (OpenCL backend) ?

Also, it would nice if you can post an entire example that can help me reproduce this on our end.

villekf commented 6 years ago

It seems that this slowdown happens when the sparse matrix components are computed in a separate function. The following code reproduces this on my end consistently (yeah, using random numbers is not a good idea, but my own codes are too large and rely on precomputed Matlab MAT-files):

#include <cstdlib>
#include <vector>
#include <random>
#include "arrayfire.h"

void ind(std::vector<int>& rows, std::vector<int>& columns, std::vector<float>& values, int maksimi, int Nx, int Ny) {
    rows.emplace_back(0);
    std::default_random_engine generator;
    std::normal_distribution<float> distribution(0.f, 1.0f);

    for (int kk = 0; kk < Nx; kk++) {
        int N_row = std::rand() % maksimi + 1;
        int temp = 0;
        for (int ll = 0; ll < N_row; ll++) {
            int apu = std::rand() % (Ny - N_row + ll - 1);
            if (apu <= temp)
                temp++;
            else
                temp = apu;
            columns.emplace_back(temp);
            values.emplace_back(distribution(generator));
        }
        rows.emplace_back(rows[kk] + N_row);
    }
}

int main {
    af::setDevice(0);

    int Nx = 128 * 128 * 109 * 2;
    int maksimi = 128;
    int Ny = maksimi * maksimi * 109;

    std::vector<int> columns;
    std::vector<int> rows;
    std::vector<float> values;

    for (int ii = 0; ii < 3; ii++) {

        ind(rows, columns, values, maksimi, Nx, Ny);

        af::array element_ar(values.size(), &values.front());
        af::array indices_ar(columns.size(), &columns.front());
        af::array lor_ar(rows.size(), &rows.front());
        af::array H = af::sparse(lor_ar.dims(0) - 1, Ny, element_ar, lor_ar, indices_ar);

        af::array Summ = af::matmul(H, af::constant(1, H.dims(0), 1), AF_MAT_TRANS);

        rows.clear();
        columns.clear();
        values.clear();
     }
    return 0;
}

The problem appears on OpenCL backend with both Nvidia cards. It is absent on CUDA backend or when using the CPU on OpenCL backend. Also, as mentioned above, there is no slowdown if the ind-function is absent and instead included in main.

9prady9 commented 6 years ago

The slow down on OpenCL backend for a CUDA device appears to start from certain size of arrays.

For size 48 * 48 * 109 * 2 and below, OpenCL backend is better or on par with CUDA backend. For any size greater than earlier mentioned, the performance drop begins. I believe this performance drop is trickling down from the upstream blas library used by ArrayFire i.e. clBlas. clBlas haven't had any commits in almost a year and I doubt there are going to be any fixes pushed into it. @umar456 Any suggestions ?

Updated: Had a typo earlier, was referring to clBLAS, not clFFT.

villekf commented 6 years ago

Is there any update on this issue? Or is there some workaround that doesn't involve lower dimensions?

9prady9 commented 6 years ago

@villekf My original conclusion based on clBLAS turned out to be wrong. sparse-dense is being done by our custom kernel. We will look into and update here once we have an update.

christophe-murphy commented 1 month ago

Looks like a majority of the time is being spent performing the binary search in the cscmv_block kernel

christophe-murphy commented 1 month ago

This is a tricky issue because the way that the transpose of the sparse matrix is carried out in the OpenCL backend is to treat the CSR matrix as a CSC matrix. In the CSC matrix, the compressed data is ordered by column rather than by row. The OpenCL kernel performs a block matrix vector multiply. This is straightforward with the CSR format but is more complicated with CSC format. In order to determine which compressed indices fall in which block for a given column, a binary search is performed. As the matrix dimensions get larger, the binary search quickly comes to dominate the kernel run time.

As far as I can see the only way to avoid the need for a binary search is to not perform the multiply in blocks. I have created a new OpenCL kernel for the CSC matrix vector multiply which makes use of atomic addition operations instead. While this is not ideal and may run slow on certain devices, I have found the runtime of this kernel to be on par with the CUDA backend on my Nvidia GPU using the example given by @villekf above. I will issue a pull request soon.