[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark

We are having ongoing efforts about supporting sparse attention in GluonNLP: https://github.com/dmlc/gluon-nlp/pull/1395. To better accelerate related kernels, we can compare the performance of these potential solutions, including:

Use BlockSparse kernel to implement the operator We may try out these implementations
Directly implement window attention
- Use CUTLASS and implement our own version
- Use TVM + Ansor: https://tvm.apache.org/docs/tutorials/auto_scheduler/tune_conv2d_layer_cuda.html#sphx-glr-tutorials-auto-scheduler-tune-conv2d-layer-cuda-py

dmlc / gluon-nlp

[Sparse Attention][Performance] Accelerate the performance of sparse attention + Benchmark #1397