Finetuning using quantized models

wj210 commented 12 months ago

Hi, like to ask if it is possible to fine-tune already quantized models? like TheBloke/Llama-2-70B-chat-GPTQ from huggingface.

The issue is that I lack the hardware to load the model first before quantizing, ie 70B model, with 4xA6000 46GB memory. But the quantized model can easily fit within 1.

Thanks!

kriskrisliu commented 12 months ago

Thanks for your question. (1) We support quantization-based PEFT on checkpoints of meta format or our format. You can find the example script here. In this case, you can load the quantized 70b model on GPUs like A6000. Note that, ~350GB CPU RAM is required during the model loading process. More specifically, assume we are finetuing the offical llama2-70B model via NormBias tuning with 2x A6000, the running script could be like:

#!/bin/bash

# Please fill the following configurations, 
# and remember to adjust other parameters like batch_size, accum_iter, and so on.
pretrained_path=<path-to-70b-folder>
llama_config=<path-to-70b-params.json>
tokenizer_path=<path-to-tokenizer.model>

pretrained_type=meta_ori
data_config=configs/data/finetune/sg/alpaca.yaml

data_parallel=sdp
model_parallel=1

exp_name=alpaca_llamaPeft_normBias_QF
echo "exp name: $exp_name"
mkdir -p output/"$exp_name"

torchrun --master_port=1112 --nproc_per_node=2 main_finetune.py \
--output_dir output/"$exp_name" --epochs 4 --warmup_epochs 1 \
--batch_size 1 --accum_iter 2 --num_workers 4 \
--max_words 512 \
--lr 0.00005 --min_lr 0.000005 --clip_grad 2 --weight_decay 0.02 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \
--llama_type llama_peft --llama_config $llama_config --tokenizer_path "$tokenizer_path" \
--no_visual \
--pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \
--data_config $data_config \
--quant --only_save_trainable \
2>&1 | tee -a output/"$exp_name"/output.log

echo "exp name: $exp_name"

I test on 2x A100-80GB with batch_size=1, the max GPU memory required is 47.02GB, while A6000 is said to have 48GB VRAM in total. So make sure you have torch2 and flash-attention installed.

(2) Loading TheBloke/Llama-2-70B-chat-GPTQ directly is not supported due to the incompatibility of data format and quantization kernel. We currently support Meta's format and our format. The quantization method we use is inherited from bitsandbytes' 4-bit quantization.

wj210 commented 11 months ago

Thanks for the answer. Another question is if I were to quantize it and fine-tune, the model saved will be quantized as well? Thus if I were to load it during inference using single_turn.py, I should not use the quant flag again?

wj210 commented 11 months ago

another thing is im unclear why, but quantization uses more memory?

Not quantized: Screenshot 2023-09-13 at 11 30 13 AM

Quantized: Screenshot 2023-09-13 at 11 32 04 AM

kriskrisliu commented 11 months ago

Thanks for the answer. Another question is if I were to quantize it and fine-tune, the model saved will be quantized as well? Thus if I were to load it during inference using single_turn.py, I should not use the quant flag again?

When quantization is activated, the --only_save_trainable flag must be set True, as we've described in the doc. When inference, pass the base model and trainable weights altogether to --pretrained_path, for instance:

torchrun --master_port=1112 --nproc_per_node=1 demos/single_turn.py \
--llama_type llama_peft \
--llama_config <path-to-70b-params.json> \
--tokenizer_path <path-to-tokenizer> \
--pretrained_path <path-to-70b-base-weight>  <path-to-trainable-weights-you-have-saved> \
--quant

kriskrisliu commented 11 months ago

another thing is im unclear why, but quantization uses more memory?

Not quantized:

Quantized:

It appears that you are running a non-quantized model with model_parallel=8, while loading a quantized model with model_parallel=1. When model_parallel=8 is used, a single 70B model is divided into 8 slices, with each slice placed on a separate GPU. As a result, the total VRAM usage of the non-quantized model becomes 17720 * 8 = 141760 (~141GB), whereas the VRAM usage of the quantized model remains at 31815 (~30GB).

However, this is just my assumption based on the provided information. If you could share the running script for the non-quantized model, I can provide more specific advice.

wj210 commented 11 months ago

I am only using the 7B model, thus the model parallel is set to 1

`export OMP_NUM_THREADS=8 export CUDA_VISIBLE_DEVICES=0,1,2, # adjust here

pretrained_path="../../llama/llama-2-7b-chat" pretrained_type=meta_ori llama_config="../../llama/llama-2-7b-chat/params.json configs/model/finetune/sg/llamaPeft_normBiasLora.json" tokenizer_path="../../llama/tokenizer.model" data_config=configs/data/finetune/sg/fin_qa.yaml

data_parallel=sdp model_parallel=1 # set according to model size, need see num shards

exp_name=finetune/invest_openai_qlora_test echo "exp name: $exp_name" mkdir -p output/"$exp_name"

torchrun --master_port=1113 --nproc_per_node=3 main_finetune.py \ --output_dir output/"$exp_name" --epochs 10 --warmup_epochs 1 \ --batch_size 1 --accum_iter 1 --num_workers 8 \ --max_words 1024 \ --lr 0.00005 --min_lr 0.000005 --clip_grad 2 --weight_decay 0.02 \ --data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \ --llama_type llama_peft --llama_config $llama_config --tokenizer_path "$tokenizer_path" \ --no_visual \ --pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \ --data_config $data_config \ --do_val \ --patience 3 \ --quant --only_save_trainable \

2>&1 | tee -a output/"$exp_name"/output.log

echo "exp name: $exp_name" `

kriskrisliu commented 11 months ago

I have successfully run the Alpaca fine-tuning process without encountering any bugs. I observed that the quantized model only requires 8GB of memory, while the fp16 model demands 15GB.

Your problem would be due to some code modification you've made that might influence the quantization process.

Alway remember that you should use quantize function to do the model quantization in CPU, and then push the quantized model to GPU with operation like model.cuda() or model.to(device).

Alpha-VLLM / LLaMA2-Accessory

Finetuning using quantized models #67