AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
Apache License 2.0
4.56k
stars
370
forks
source link
[Feature Request] Compressed-tile Matrix Multiply for autoregressive LLM inference #699
Is your feature request related to a problem? Please describe.
I would like to request the implementation of a compressed tiled matrix multiply operator for use in large language model inference.
This feature would open up the path to accelerate inference speeds for autoregressive LLMs that have undergone unstructured pruning.
LLM autoregressive inference is notoriously memory bound, with GPU utilization of less than 1% being commonplace.
The main cause of this bottleneck is that the size of the weight tensors is several hundred gigabytes, while there are very few computations performed on them, especially in the case of single-batch inference, which is unfortunately very common when low latency is required.
The simplest solution for this problem is to use compressed tensors on DRAM and decompress them dynamically while calculating the matrix multiply. As matrix multiplication is already performed using tiles to maximize memory locality, I do not think compressing the tiles beforehand would be difficult to integrate.
Decompression of the tiles would then occur on shared memory, from where they could be fed into the tiled matrix multiplication. Also, L2 cache utilization would likely be higher due to the smaller data size.
In-memory compression is available in data center GPUs such as the A100 or the H100. However, compressed memory allocation is not accessible from the CUDA runtime API and must be allocated from the driver API, making it difficult to integrate with existing libraries. Not everyone has access to data center GPUs, and a software implementation would make this feature available even on consumer GPUs. Also, hardware in-memory compression does not reduce the assigned memory in HBM or L2 cache, making it ineffective in reducing memory size.
The nvCOMP https://github.com/NVIDIA/nvcomp library provides an implementation of LZ4 and other compression algorithms. However, it is no longer open-source and also does not implement shuffling filters. Moreover, it cannot be integrated with tiled matrix multiplication.
Additional context
Unstructured pruning is the easiest kind of model compression to apply but also the least useful because no calculations can be skipped. However, in the highly memory-constrained case of LLM inference, the main bottleneck is DRAM/HBM memory size and read speed, both of which can be alleviated via tensor compression. Assuming that the models have been sparsified sufficiently, even ten-fold memory reduction and acceleration are feasible. Also, it may become possible to perform LLM inference on consumer GPUs at reasonable latency.
Is your feature request related to a problem? Please describe.
I would like to request the implementation of a compressed tiled matrix multiply operator for use in large language model inference. This feature would open up the path to accelerate inference speeds for autoregressive LLMs that have undergone unstructured pruning. LLM autoregressive inference is notoriously memory bound, with GPU utilization of less than 1% being commonplace. The main cause of this bottleneck is that the size of the weight tensors is several hundred gigabytes, while there are very few computations performed on them, especially in the case of single-batch inference, which is unfortunately very common when low latency is required.
The simplest solution for this problem is to use compressed tensors on DRAM and decompress them dynamically while calculating the matrix multiply. As matrix multiplication is already performed using tiles to maximize memory locality, I do not think compressing the tiles beforehand would be difficult to integrate.
Describe the solution you'd like
I would like to propose a new primitive where the inputs are large matrices that have been chunked and compressed beforehand. The compression algorithm would be implemented by first applying byte shuffling and bit shuffling filters on the tile. See https://earthscience.stackexchange.com/questions/12527/regarding-compression-shuffle-filter-of-netcdf4 for an explanation of shuffle filtering and https://github.com/kiyo-masui/bitshuffle for the implementation of bit shuffling. An alternative shuffling method may be more appropriate for floating point data.
Decompression of the tiles would then occur on shared memory, from where they could be fed into the tiled matrix multiplication. Also, L2 cache utilization would likely be higher due to the smaller data size.
This repository may be useful for cases where the weights have been quantized to integers. https://github.com/powturbo/TurboPFor-Integer-Compression
See https://github.com/Blosc/c-blosc2 for algorithms and design patterns on compression.
Describe alternatives you've considered
In-memory compression is available in data center GPUs such as the A100 or the H100. However, compressed memory allocation is not accessible from the CUDA runtime API and must be allocated from the driver API, making it difficult to integrate with existing libraries. Not everyone has access to data center GPUs, and a software implementation would make this feature available even on consumer GPUs. Also, hardware in-memory compression does not reduce the assigned memory in HBM or L2 cache, making it ineffective in reducing memory size.
The nvCOMP https://github.com/NVIDIA/nvcomp library provides an implementation of LZ4 and other compression algorithms. However, it is no longer open-source and also does not implement shuffling filters. Moreover, it cannot be integrated with tiled matrix multiplication.
Additional context
Unstructured pruning is the easiest kind of model compression to apply but also the least useful because no calculations can be skipped. However, in the highly memory-constrained case of LLM inference, the main bottleneck is DRAM/HBM memory size and read speed, both of which can be alleviated via tensor compression. Assuming that the models have been sparsified sufficiently, even ten-fold memory reduction and acceleration are feasible. Also, it may become possible to perform LLM inference on consumer GPUs at reasonable latency.