UDC-GAC / venom

A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
Apache License 2.0
31 stars 5 forks source link

Regarding the error when N is not divisable by V #4

Open BNAadministrator3 opened 5 months ago

BNAadministrator3 commented 5 months ago

Thanks for the terrific work and codes!

  1. When I run ./benchmark/run_spmm_spatha.sh, I find some shapes are not working. For example, (M=192, K=192, N=3168), which possibly occurs for linear layers in Deit-tiny with certain batch sizes. For this shape, the following error has occurred:

    image

    I suppose this is because the N is not divisible by bm, where bm refers to the "v" in a "v:n:m" pattern. However, I have no idea how to fix it. Would you please check the problem?

  2. Some confusions about ./end2end/run_inference.sh : what is the difference between v64 and non-v64 version? Besides, I noticed when a single spmm operation is tested, many configurations are running like those in the script "run_spmm_spatha.sh". In an end-to-end speed test scenario like Figure 15, do we also need to try various configurations for an optimal end-to-end speedup performance? If yes, where are the corresponding codes?

Thanks again.

LopezCastroRoberto commented 5 months ago
  1. N has to be divisible by bn, not bm. Note that bn is the tile size you choose for the N dimension.
  2. Just one configuration is selected in that scenario. You can check the selected configuration, for example for v=64, at https://github.com/UDC-GAC/venom/blob/8ddaf38ef918d4aedfea12696fed60348c5ccd10/end2end/spatha_mod/block_sparse/spmm/blockwise_library_v64.cu#L323
BNAadministrator3 commented 5 months ago

Thank you very much for your reply. One more question, since the optimal configuration for each layer's sparse matrix-matrix (SPMM) operations varies depending on the shape of the matrices involved, I intend to configure different settings for different layers, how could I achieve this?

BNAadministrator3 commented 4 months ago

Two questions:

  1. How can I improve the end2end speedups of DeiT-series networks? I find for DeiT networks, 2:8:128 sparse DeiTs are slower than their dense counterparts. In the figure below, the numbers 0, 1, and 2 in the “algo” column represent DeiT-tiny, small, and base, respectively. The "bs" represents different batch sizes. It can be seen that in most settings, the inference time of the 2:8:128 sparse networks is longer than that of the dense networks. The scripts used for this experiment are exactly located in the "end2end" folder of the venom project: image

I found the reason that v:n:m sparse DeiT models are slower than their dense equivalents is the "contiguous" operator takes up most of the time when the hidden dimensions of DeiTs are relatively small compared to BERT or GPT-2.

image

To be specific, I create a single-linear model consisting of only a linear layer. The linear layer's shape is exactly set as the DeiT-tiny's hidden dimension. Then I profile the model's inference time when it is dense and 2:8:128 sparse, respectively. The results shown below demonstrate that the contiguous operator is time-consuming: imageIs there any way to optimize the overheads? For instance, make the transposition of input to be implicit. Or, besides $Y=WX^T$, kindly ask if it is accessible for the venom sparse version of $Y=XW^T$. The transposition of W is offline and possibly time may be saved.

  1. Does the v:n:m spmm kernel have the potential to be used in activation sparsity? If so, how could I manipulate the kernel?

Looking forward to your reply.