Support for Block SpMV - Githubissues

ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

http://vexcl.readthedocs.org

MIT License

699 stars 81 forks source link

Support for Block SpMV #274

Open daaugusto opened 4 years ago

daaugusto commented 4 years ago

Hi, first of all, thanks for the development and maintenance of this great library! I would like to know if there is any plan to support block sparse matrices and block SpMV with constant-size blocks. This would speedup the kernel by reducing the indexing overhead for this class of matrices.

Thank you.

ddemidov commented 4 years ago

It is possible, but is not directly provided by the library.

There is an example of a block-valued spmv in vexcl backend for amgcl here: https://github.com/ddemidov/amgcl/blob/master/amgcl/backend/vexcl_static_matrix.hpp

The code extends vexcl sparse matrix operations to support amgcl::static_matrix<T,N,N> as value type for vexcl matrices.

daaugusto commented 4 years ago

Thank you for your prompt reply, I will try to follow it. Do you have an estimate of the performance gain against the scalar version with regard to SpMV?

ddemidov commented 4 years ago

It should be faster, but not dramatically so. Here is a quick experiment of using amgcl to solve an elasticity problem which has a 4x4 block structure.

In the first case the matrix is treated as a scalar one, and pointwise aggregation coarsening (block_size=4) is employed

./solver_vexcl_cl -B -A problems/block4/A.bin -f problems/block4/b.bin -p \
  solver.{type=lgmres,tol=1e-4} \
  precond.relax.type=damped_jacobi \
  precond.coarsening.type=aggregation \
  precond.coarsening.aggr.eps_strong=0 \
  precond.coarsening.aggr.block_size=4
1. Tesla K40c (NVIDIA CUDA)

Solver
======
Type:             LGMRES(30,3)
Unknowns:         746944
Memory footprint: 216.57 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.28
Grid complexity:     1.26
Memory footprint:    890.87 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       746944       41921568    697.99 M (78.34%)
    1       173784       10569056    174.68 M (19.75%)
    2        19776         972896     16.37 M ( 1.82%)
    3         1712          51360      1.83 M ( 0.10%)

Iterations: 43
Error:      8.50538e-05

[Profile:       4.458 s] (100.00%)
[ self:         0.200 s] (  4.48%)
[  reading:     0.550 s] ( 12.33%)
[  setup:       2.289 s] ( 51.34%)
[  solve:       1.420 s] ( 31.85%)

In the second case the matrix is treated as block-valued with 4x4 blocks (-b4):

./solver_vexcl_cl -B -A problems/block4/A.bin -f problems/block4/b.bin -p \
  solver.{type=lgmres,tol=1e-4} \
  precond.relax.type=damped_jacobi \
  precond.coarsening.type=aggregation \
  precond.coarsening.aggr.eps_strong=0 \
  -b4
1. Tesla K40c (NVIDIA CUDA)

Solver
======
Type:             LGMRES(30,3)
Unknowns:         186736
Memory footprint: 216.57 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.05
Grid complexity:     1.05
Memory footprint:    451.76 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0       186736        2620098    431.07 M (95.63%)
    1         8469         116745     19.28 M ( 4.26%)
    2          397           2977      1.41 M ( 0.11%)

Iterations: 47
Error:      9.0758e-05

[Profile:       3.016 s] (100.00%)
[ self:         0.208 s] (  6.89%)
[  reading:     0.557 s] ( 18.45%)
[  setup:       1.072 s] ( 35.56%)
[  solve:       1.179 s] ( 39.10%)

As you can see, the problem solution requires similar number of iterations (where each iteration consists of a number of spmv operations) in both cases, but the block-valued one is about 30% faster (the 'solve' line in profile). Also, the method is constructed ('setup') much faster in block-valued case, but that is done on the CPU side.

daaugusto commented 4 years ago

Hi Denis, with the help of AMGCL I was able to add support for block matrices in my application, thanks! Now I'm trying to inline a sequence of operations, which I can do using scalar matrices but not with block matrices:

This sequence (scalar):

   vex::vector<double> tmp( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
   vex::vector<double> u( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
   tmp = Z * x;
   u = D * tmp;
   y = W * u;

is inlined as:

   vex::vector<double> tmp( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );
   tmp = (D * vex::make_inline(Z * x));
   y = W * tmp;

where Z and W are of type vex::SpMat<double, int, cl_long>, and x, y and D are vex::vector<double>; n is the number of entries of the scalar matrix. (I wish I could inline it even further but I wasn't able to do so, but the current form already delivers about 60% of speed up).

However, with the block version, if I try to inline that sequence I get the error:

error: no matching function for call to ‘make_inline(std::enable_if<true, vex::sparse::matrix_vector_product<vex::sparse::ell<amgcl::static_matrix<double, 16, 16>, int, long int>, vex::vector<amgcl::static_matrix<double, 16, 1> > > >::type)’
                tmp = (D * vex::make_inline(Z * x));
                           ~~~~~~~~~~~~~~~~~^~~~~~

where:

Z and W are of type vex::sparse::ell<value_type_matrix, int, cl_long>> D is vex::vector<value_type_matrix> x and y are vex::vector<value_type_vector> tmp is declared as vex::vector<value_type_vector> tmp( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE ); u is declared as vex::vector<value_type_vector> u( DeviceManager::defaultQueueAsVector(), n, 0, vex::backend::MEM_READ_WRITE );

value_type_vector = amgcl::static_matrix<double,BS,1>; // BS is the block size value_type_matrix = amgcl::static_matrix<double,BS,BS>;

Also, in order to perform D * tmp with block matrices I have to do:

amgcl::backend::vmul( 1.0, D, tmp, 0.0, u );

Do you think it is possible to inline the block matrix version? If so, do you have any suggestion on how to accomplish that?

Thank you very much.

ddemidov commented 4 years ago

I think you should use vex::sparse::matrix instead of vex::SpMat to make everything inline (it does not require a make_inline call, operations with vex::sparse::matrix are inlined by default).

The block matrices that are implemented in amgcl with amgcl/backend/vexcl_static_matrix.hpp do support inlining. For example, the residual operation is completely inline here:

https://github.com/ddemidov/amgcl/blob/d6afcf7236801fc1726639b4fbad9f66a36ea61b/amgcl/backend/vexcl.hpp#L313

daaugusto commented 4 years ago

Hi Denis, I converted all the declarations to vex::sparse::matrix but I'm still getting an error with the block matrices. I have this:

std::unique_ptr<vex::sparse::matrix<value_type_matrix, int, int>> m_Z; // value_type_matrix = amgcl::static_matrix<double,3,3>
std::unique_ptr<vex::sparse::matrix<value_type_matrix, int, int>> m_W; // value_type_matrix = amgcl::static_matrix<double,3,3>
std::unique_ptr<vex::vector<value_type_matrix>> m_D; // Diagonal "matrix", value_type_vector = amgcl::static_matrix<double,3,1>

Then, I can compile the following:

 tmp = *m_D * (*m_Z * x); // tmp, x and y are vex::vector<value_type_vector>
 y = *m_W * tmp;

But when I run it I receive this error:

typedef struct { double data[3][3]; } amgcl_matrix_double_3x3;
typedef struct { double data[3][1]; } amgcl_matrix_double_3x1;

kernel void vexcl_vector_kernel
(
  ulong n,
  global amgcl_matrix_double_3x1 * prm_1,
  global amgcl_matrix_double_3x3 * prm_2,
  ulong prm_3_ell_width,
  ulong prm_3_ell_pitch,
  global int * prm_3_ell_col,
  global double * prm_3_ell_val,
  global int * prm_3_csr_ptr,
  global int * prm_3_csr_col,
  global double * prm_3_csr_val,
  global amgcl_matrix_double_3x1 * prm_3_x_1
)
{
  for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0))
  {
    amgcl_matrix_double_3x1 prm_3_sum;
    prm_3_sum.data[0][0] = 0;
    prm_3_sum.data[1][0] = 0;
    prm_3_sum.data[2][0] = 0;
    {
      ulong prm_3_ell_size = prm_3_ell_width * prm_3_ell_pitch;
      for(size_t j = 0; j < prm_3_ell_width; ++j)
      {
        int nnz_idx = idx + j * prm_3_ell_pitch;
        int c = prm_3_ell_col[nnz_idx];
        if (c != (int)(-1))
        {
          int idx = c;
          {
            amgcl_matrix_double_3x1 v = prm_3_x_1[idx];
            prm_3_sum.data[0][0] += prm_3_ell_val[0 * prm_3_ell_size + nnz_idx] * v.data[0][0] + prm_3_ell_val[1 * prm_3_ell_size + nnz_idx] * v.data[1][0] + prm_3_ell_val[2 * prm_3_ell_size + nnz_idx] * v.data[2][0];                    
            prm_3_sum.data[1][0] += prm_3_ell_val[3 * prm_3_ell_size + nnz_idx] * v.data[0][0] + prm_3_ell_val[4 * prm_3_ell_size + nnz_idx] * v.data[1][0] + prm_3_ell_val[5 * prm_3_ell_size + nnz_idx] * v.data[2][0];                    
            prm_3_sum.data[2][0] += prm_3_ell_val[6 * prm_3_ell_size + nnz_idx] * v.data[0][0] + prm_3_ell_val[7 * prm_3_ell_size + nnz_idx] * v.data[1][0] + prm_3_ell_val[8 * prm_3_ell_size + nnz_idx] * v.data[2][0];                    
          }
        } else break;
      }
      if (prm_3_csr_ptr)
      {
        ulong prm_3_csr_size = prm_3_csr_ptr[n];
        int csr_beg = prm_3_csr_ptr[idx];
        int csr_end = prm_3_csr_ptr[idx+1];
        for(int j = csr_beg; j < csr_end; ++j)
        {
          int idx = prm_3_csr_col[j];
          {
            amgcl_matrix_double_3x1 v = prm_3_x_1[idx];
            prm_3_sum.data[0][0] += prm_3_csr_val[0 * prm_3_csr_size + j] * v.data[0][0] + prm_3_csr_val[1 * prm_3_csr_size + j] * v.data[1][0] + prm_3_csr_val[2 * prm_3_csr_size + j] * v.data[2][0];                                      
            prm_3_sum.data[1][0] += prm_3_csr_val[3 * prm_3_csr_size + j] * v.data[0][0] + prm_3_csr_val[4 * prm_3_csr_size + j] * v.data[1][0] + prm_3_csr_val[5 * prm_3_csr_size + j] * v.data[2][0];                                      
            prm_3_sum.data[2][0] += prm_3_csr_val[6 * prm_3_csr_size + j] * v.data[0][0] + prm_3_csr_val[7 * prm_3_csr_size + j] * v.data[1][0] + prm_3_csr_val[8 * prm_3_csr_size + j] * v.data[2][0];                                      
          }
        }
      }
    }
    prm_1[idx] = ( prm_2[idx] * prm_3_sum );
  }
}
<kernel>:94:31: error: invalid operands to binary expression ('amgcl_matrix_double_3x3' and 'amgcl_matrix_double_3x1')
    prm_1[idx] = ( prm_2[idx] * prm_3_sum );
                   ~~~~~~~~~~ ^ ~~~~~~~~~

No problem running that with the scalar version, though.

ddemidov commented 4 years ago

The kernel looks like it is almost working for you. The problem is that OpenCL does not know how to multiply amgcl_matrix_double_3x3 and amgcl_matrix_double_3x1. Since OpenCL is C-based, you can not overload arithmetic operations, so you will need to provide a custom function that will do the multiplication. The resulting expression should look like

 tmp = mul(*m_D, (*m_Z * x));

The example of such a function implemented in a generic way can be found here:

https://github.com/ddemidov/amgcl/blob/461a66ce6d197a3816218bf94ffd114a367c1ef1/amgcl/backend/vexcl_static_matrix.hpp#L733-L765

This is used in

https://github.com/ddemidov/amgcl/blob/461a66ce6d197a3816218bf94ffd114a367c1ef1/amgcl/backend/vexcl_static_matrix.hpp#L851-L869

If you only need the 3x3 case, the function definition may be as simple as

VEX_FUNCTION(value_type_vector, mul, (value_type_matrix, a)(value_type_vector, x),
    amgcl_matrix_double_3x1 r;
    r[0][0] = a[0][0] * x[0] + a[0][1] * x[1] + a[0][2] * x[2];
    r[1][0] = a[1][0] * x[0] + a[1][1] * x[1] + a[1][2] * x[2];
    r[2][0] = a[2][0] * x[0] + a[2][1] * x[1] + a[2][2] * x[2];
    return r;
);

daaugusto commented 4 years ago

Hi Denis, thank you for your answer. Following the proposed approach, would the following work as well (gets rid of tmp)?

   y = *m_W * mul(*m_D,  (*m_Z * x));

And, if so, would it be more efficient than doing the calculation in two steps:

   tmp = mul(*m_D, (*m_Z * x));
   y = *m_W * tmp;

Thanks again.

ddemidov commented 4 years ago

m_W is a sparse matrix, right? This should work, but I think it would be less efficient because it would involve a lot of duplicated and badly cached memory reads. But it never hurts to check ;).

daaugusto commented 4 years ago

Yes, it is a sparse block matrix. I was hopping VexCL would magically fuse them into a single kernel... ;-) I'll check which is best. Thank you.

ddemidov commented 4 years ago

VexCL will fuse them into a single kernel, but I am afraid the kernel will be quite ineffective.

daaugusto commented 4 years ago

Hi Denis, I ended up using vex_mul directly instead of defining a custom VexCL function:

tmp = amgcl::backend::vex_mul<double, double, BS>().apply( *m_D, ( *m_Z * x ) );                                                                                                                                               
y = *m_W * tmp;

which leads to a speedup of 1.2 over doing three separate operations (tmp = *m_Z * x; u = *m_D * tmp; y = *m_W * u). Also, your predictions were correct; the following "one-line" works but is more than 10 times slower than the "two-line" version:

y = *m_W * amgcl::backend::vex_mul<double, double, BS>().apply( *m_D, ( *m_Z * x ) );

Thank you.