ReaLLMASIC / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
23 stars 17 forks source link

Quantization and Binarization Implementations for Linear Layers #184

Closed mmoffatt2 closed 1 week ago

mmoffatt2 commented 2 months ago

Following code from Quantization Repo, integrated quantization into the nanoGPT repo.

Added a new QuantizedLinear layer that replaced linear layers and simulates quantization aware training. Also, added a binarization layer.

gkielian commented 2 months ago

Awesome! This trains and converges!

We should try what the output from quantized inference looks like with the sample.py. When trying to run sample.py on the created checkpoints, it encounters unexpected keys in the state_dict: image

Could you take a look into how to test with the sample.py (feel free to modify if needed)?

mmoffatt2 commented 2 months ago

Unfortunately, I didn't think ahead to split these changes into multiple PRs. I will try to summarize all the changes as best as possible.

Updates: Added two options for quantization: "affine_quant" and "stochastic_quant" Added QuantizedEmbedding to position_encoding_variations.py, which can be used for word embeddings and position embeddings Added quantize_helper function which is used for quantization of the outputs of matrix multiplications and linear layers

train.py arguments: "quantization_linear_method": quantization method used for all quantized linear layers "quantize_wte": whether the word embedding weights are quantized (using QuantizedEmbedding) "quantize_wpe": whether the position embedding weights are quantized (using QuantizedEmbedding) "quantization_embedding_method": method of quantization used for embeddings "quantize_attn_all": this boolean argument will quantize everything in the attn layer, including outputs and linear layers "quantize_c_attn_q": whether the query weights are quantized (using QuantizedLinear) "quantize_c_attn_k": whether the key weights are quantized (using QuantizedLinear) "quantize_c_attn_v": whether the value weights are quantized (using QuantizedLinear) "quantize_q": whether the query output is quantized (using quantize_helper) "quantize_k": whether the key output is quantized (using quantize_helper) "quantize_v": whether the value output is quantized (using quantize_helper) "quantize_q_k_mult": whether the output of the query key matrix multiplication is quantized (using quantize_helper) "quantize_softmax_v_mult": whether the output of the softmax value matrix multiplication is quantized (using quantize_helper) "quantize_softmax": whether the output of the softmax variant is quantized (using quantize_helper) "quantize_attn_proj": whether the output projection is quantized (using QuantizedLinear) "quantize_mlp_all": this boolean argument will quantize everything in the mlp layer, including activation outputs and linear layers "quantize_mlp_up": whether the mlp_up weights are quantized (using QuantizedLinear) "quantize_mlp_down": whether the mlp_down weights are quantized (using QuantizedLinear) "quantize_activation": whether the output of the activation function is quantized (using quantize_helper) "quantization_activation_method": method of quantization for the activation output

Example code for running train.py: python3 train.py --device="cuda" --dataset="shakespeare_char" --out_dir="out_shakespeare_char" --linear_variant="quantized_linear" --quantize_attn_all --quantize_mlp_all --quantize_wte --quantize_wpe

Example code for running sample.py: python3 sample.py --out_dir="out_shakespeare_char" --device="cuda" --quant_weights_file="shakespeare" --visualize_weights_dir="shakespeare_weights"

gkielian commented 1 month ago

Will start diving into the code. one thing we'll have to separate out the quantization for the attention output projection -- experiment with this show that it might need it's own unique precision (possibly higher) in order to operate well.