DeepWok / mase

Machine-Learning Accelerator System Exploration Tools
Other
105 stars 52 forks source link

ADLS Group 7 LLM int #84

Closed Zixian-Jin closed 3 months ago

Zixian-Jin commented 3 months ago

Group7: LLM.int Hardware Integration for MASE

Overview

LLM.int() is the state-of-art GPU implementation for large language model inference. It scatters a matrix to two groups, low-precision and high-precision matrices, and compute them separately using efficient hardware. This project implements LLM.int() algorithm on FPGA using existing linear layer components in MASE. data_stream_no_csr

File Hierarchy

Most of the design files and testbenches are under the folder MASE_RTL/llm, where MASE_RTL = ROOT/machop/mase_components/, and ROOT is the root directory of MASE project.

MASE_RTL/llm

Testbench Settings

1. How to Run

Please ensure that you have installed all environment required for [MASE](). Specifically, make sure Cocotb and Verilator has been successfully installed.

To test a .sv design file example.sv, just locate to the testbench folder ROOT/machop/mase_components/llm/test and run the corresponding .py files:

Python ./example_tb.py

2. Generating Input Data for DUT

The testbench file (e.g., llm_int8_top_tb.py) uses RandomSource defined in ROOT/machop/mase_cocotb/random_test.py to generate random input tensors to the SV module llm_int8_top.sv. It can generate two patterns of data, depending on the argument arithmetic passed to it:

You can for sure tune parameters of the random number generator to produce tensors of different distributions. However, please keep in mind that the magnitude of generated data must be carefully controlled to avoid DUT computation overflow.

3. Sample Simulation Result

Here is an example test result for running the testbench llm_int8_top_tb.py.

--------------------- Error Analysis --------------------
Sample Num=100
No. Samples above Error Thres(10)=0
Absolute Error: max=10, avg=4
Relative Error: max=1.12%, avg=0.35%
--------------------- End of Error Analysis --------------------

  3718.00ns INFO     cocotb.regression                  test_llm_int8_quant_tb passed
  3718.00ns INFO     cocotb.regression                          
    ************************************************************************************************
    ** TEST                                    STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
    ************************************************************************************************
    ** llm_int8_top_tb.test_llm_int8_quant_tb   PASS        3718.00           0.24      15345.27  **
    ************************************************************************************************
    ** TESTS=1 PASS=1 FAIL=0 SKIP=0                         3718.00           1.10       3366.91  **
    ************************************************************************************************                                          
- :0: Verilog $finish
INFO: Results file: /home/ic/MYWORKSPACE/Mase-DeepWok/machop/mase_components/llm/test/build/llm_int8_top/test_0/results.xml
TEST RESULTS
    PASSED: 1
    FAILED: 0
    NUM TESTS: 1

The test report contains two test sections:

Implementation Details

Since FPGA has limited I/O ports and memory units, the LLM matrix is partitioned into small sub-matrices with identical size N*M. These matrices are flattened as a 1-d vector and are streamed into FPGA for computation. mat_mul_top_level

Inside llm_int8_top the input activations $X_{f16}$ with FP16 precision is passed through a scatter module.

The module detects large-magnitude element (a.k.a emergent outliers mentioned in the paper) in $X{f16}$ and places them to the high-precision matrix $X{HP, f16}$. This matrix is then passed to the module fixed_matmul_core for FP16 matrix multiplication.

The rest of the elements in $X{f16}$ have small magnitude so they go into the low-precision matrix $X{LP, f16}$. It is then passed to the module quantized_matmul for Int8 quantization, Int8 matrix multiplication and de-quantization.

The outputs of the two matmul components both have FP16 precision, and are gathered as the final output matrix.

top_level llm_int8_top takes a dataflow architecture and is deeply pipelined. Each stage in the pipeline communicates with up-stream and down-stream stages with handshake protocols.

Design details of the sub-modules can be found here.

References

  1. Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022
  2. Jianyi Cheng, Cheng Zhang, Zhewen Yu, Alex Montgomerie-Corcoran, Can Xiao, Christos-Savvas Bouganis, and Yiren Zhao. Fast prototyping next-generation accelerators for new ml models using mase: Ml accel- erator system exploration, 2023
  3. Machine-Learning Accelerator System Exploration Tools