What levels of sparsity is this useful for?

SimLeek commented 8 months ago

I'm currently working with matrices where >99% of the values of 0, and it's very unlikely most of the blocks will be larger than 1x1 in size. However, the libsmm_acc code makes me think it could still be useful there, since SMM is used for the smaller blocks and there's some cuda optimization.

I was able to create large matrices by modifying the example dbcsr_example_3.cpp and got 1 teraflop performance, but with occupation of 1. It looks like when I assign values with c_dbcsr_iterator_next_2d_block_d, it treats the blocks like full 2D matrices. I could still calculate the 1x1 blocks and their resulting locations, but I feel like that wouldn't be very fast.

So, is this library useful when all matrices have >99% sparsity? If so, how would I set things up for that? If not, what specific types of sparsity or sparsity percentages does it help with?

Here was my main modification to the example, with density set to 0.01:

while (c_dbcsr_iterator_blocks_left(iter)) {
  //...
  c_dbcsr_iterator_next_2d_block_d(iter, &i, &j, &blk, &tr, &nblk, &rsize, &csize, &roff, &coff);
  for (int g=0; g<rsize*csize;g++){
    double r = static_cast<double>(std::rand()) / RAND_MAX;
    if ((density<1.0 &&r<density ) || density>=1.0){
        blk[g] = static_cast<double>(std::rand()) / RAND_MAX;
    }
  }
}

alazzaro commented 8 months ago

I'm currently working with matrices where >99% of the values of 0, and it's very unlikely most of the blocks will be larger than 1x1 in size. However, the libsmm_acc code makes me think it could still be useful there, since SMM is used for the smaller blocks and there's some cuda optimization.

The <1% occupancy is fine for the library. However, as you pointed out, the library is really optimized for block-sparsity (DBCSR = distributed block CSR), so you will not get any special optimization for the multiplication of such blocks (they will simply run on the CPU, i.e. no GPU involved).

I was able to create large matrices by modifying the example dbcsr_example_3.cpp and got 1 teraflop performance, but with occupation of 1.

For this particular case of dense matrices, we run a "densification" algorithm.

It looks like when I assign values with c_dbcsr_iterator_next_2d_block_d, it treats the blocks like full 2D matrices. I could still calculate the 1x1 blocks and their resulting locations, but I feel like that wouldn't be very fast.

This is correct, we always assume blocks, so single elements are treated as 1x1 blocks. We don't run any special optimizationf for that (it is a bit out of the scope for DBCSR). Still, the library will work, but I would assume the performance would be low.

In conclusion, sparsity is fine (we use to run tests with 0.01% of occupancy), but we do expect blocks-sparse to get the best performance out of the library.

SimLeek commented 8 months ago

Fair. I guess 'maximum local sparsity' should be something like 75% for a block size, and it might be useful to do a game of life style convolve, 'if 3 values in a square of 4 involving this location are non-zero, then count this zero as non-zero', then build an R-tree for 100% dense rectangles, if you know there's going to be grouping but you don't know exactly where.

One of my matrices might be like that in a best case scenario, with most non-zero elements along the diagonal, sand I might be able to just set some grouping on the diagonal ahead of time, but the other matrices will actually avoid clumping.

Anyway, I think that answers the question in this issue: libsmm_acc does not optimize 1x1 blocks, nor does any other part of the library, since that's outside the scope of DBCSR. If that's correct, I think this can be closed.

alazzaro commented 8 months ago

Anyway, I think that answers the question in this issue: libsmm_acc does not optimize 1x1 blocks, nor does any other part of the library, since that's outside the scope of DBCSR. If that's correct, I think this can be closed.

This is correct. Note that the library will work for 1x1 elements, but there will a massive overhead for assuming "blocks" (lookup of the dimensions and calculation of the position) that are useless for single elements... On the other hand, we could think to build the matrices with blocks of a given dimension, where the blocks are sparse inside. Then what we need is a way to group 1x1 blocks in larger blocks...

Out of curiosity, what's your use case?

SimLeek commented 8 months ago

I'm trying to get optimized sparsity working for linear layers in pytorch, preferably with each layer being variational to maintain sparsity, then conv layers. I'm looking at smaller SPGeMM and SpMSpV libraries now that look hopeful, but I'll have to test them.

Unfortunately, that variational rule is likely what's going to make it so that most of the blocks, even 2x2, will be sparse. And now that I think about it, that might apply to the synapses as well as the input, since having meaningfully different input connections would help keep an output from activating at the same time as another output.

cp2k / dbcsr

What levels of sparsity is this useful for? #752