Pull Request: LLM.int8() Hardware Design Implementation for Linear Layer
Overview
This PR introduces the LLM.int8() method implementation, focusing on hardware optimizations for large language models (LLM) through dynamic 8-bit quantization of non-outlier features in a linear layer. Our goal is to reduce area and power while maintaining accuracy by using dual-precision multiplication.
Added Features
Scatter Module: A new module that allocates outlier activations to a high precision matrix and other activations to a low precision matrix. Compares activation absolute values against a threshold to identify outliers, directing these to a high precision matrix with an associated mask. Uses priority encoders to handle cases with more outliers than high precision slots, prioritizing outliers based on their index positions.
Priority Encoder: A new priority encoder design was added that can take an arbitary number of inputs and priority outputs .
Matrix Multiplication Units: Separate units for high and low precision calculations allow for parallel processing. Operates concurrently on high and low precision matrices to optimize processing workload based on precision requirements.
Gather Module: Updated to combine results from both precision units.
LLMInt.8() Software model: A model of LLMInt.8() is integrated into the hardware testbench to compare the MSE of the LLMInt.8() quantisation scheme to existing schemes.
Configurability
Parameters such as precision levels, input dimensions, and threshold values can be adjusted to suit different models and requirements.
The current design is optimized for 1D arrays, with potential for 2D array support by unpacking feature matrix.
Impact
The integration of LLM.int8() hardware aims to reduce area and power usaged for neural network and LLM inference, without compromising output accuracy.
Pull Request: LLM.int8() Hardware Design Implementation for Linear Layer
Overview
This PR introduces the LLM.int8() method implementation, focusing on hardware optimizations for large language models (LLM) through dynamic 8-bit quantization of non-outlier features in a linear layer. Our goal is to reduce area and power while maintaining accuracy by using dual-precision multiplication.
Added Features
Scatter Module: A new module that allocates outlier activations to a high precision matrix and other activations to a low precision matrix. Compares activation absolute values against a threshold to identify outliers, directing these to a high precision matrix with an associated mask. Uses priority encoders to handle cases with more outliers than high precision slots, prioritizing outliers based on their index positions.
Priority Encoder: A new priority encoder design was added that can take an arbitary number of inputs and priority outputs .
Matrix Multiplication Units: Separate units for high and low precision calculations allow for parallel processing. Operates concurrently on high and low precision matrices to optimize processing workload based on precision requirements.
Gather Module: Updated to combine results from both precision units.
LLMInt.8() Software model: A model of LLMInt.8() is integrated into the hardware testbench to compare the MSE of the LLMInt.8() quantisation scheme to existing schemes.
Configurability
Parameters such as precision levels, input dimensions, and threshold values can be adjusted to suit different models and requirements. The current design is optimized for 1D arrays, with potential for 2D array support by unpacking feature matrix.
Impact
The integration of LLM.int8() hardware aims to reduce area and power usaged for neural network and LLM inference, without compromising output accuracy.
Running Tesbenches
Additions
mase_components/
priority_encoder/
rtl/
priority_encoder.c
scatter/
rtl/
test/
gather/
rtl/
test/
llmint/
rtl/
test/
README.md