DeepWok / mase

Machine-Learning Accelerator System Exploration Tools
Other
117 stars 52 forks source link

Group 14: Hardware support for LLM.int8() #73

Closed alexfrater closed 6 months ago

alexfrater commented 6 months ago

Pull Request: LLM.int8() Hardware Design Implementation for Linear Layer

Overview

This PR introduces the LLM.int8() method implementation, focusing on hardware optimizations for large language models (LLM) through dynamic 8-bit quantization of non-outlier features in a linear layer. Our goal is to reduce area and power while maintaining accuracy by using dual-precision multiplication.

image

Added Features

Scatter Module: A new module that allocates outlier activations to a high precision matrix and other activations to a low precision matrix. Compares activation absolute values against a threshold to identify outliers, directing these to a high precision matrix with an associated mask. Uses priority encoders to handle cases with more outliers than high precision slots, prioritizing outliers based on their index positions.

Priority Encoder: A new priority encoder design was added that can take an arbitary number of inputs and priority outputs .

Matrix Multiplication Units: Separate units for high and low precision calculations allow for parallel processing. Operates concurrently on high and low precision matrices to optimize processing workload based on precision requirements.

Gather Module: Updated to combine results from both precision units.

LLMInt.8() Software model: A model of LLMInt.8() is integrated into the hardware testbench to compare the MSE of the LLMInt.8() quantisation scheme to existing schemes.

Configurability

Parameters such as precision levels, input dimensions, and threshold values can be adjusted to suit different models and requirements. The current design is optimized for 1D arrays, with potential for 2D array support by unpacking feature matrix.

Impact

The integration of LLM.int8() hardware aims to reduce area and power usaged for neural network and LLM inference, without compromising output accuracy.

Running Tesbenches

python3 machop/mase_components/llm_int/test/LLMint_tb.py

python3 machop/mase_components/scatter/test/scatter_threshold_tb.py

python3 machop/mase_components/scatter/test/gather_tb.py

Additions