How can I change model tensor type to float16?

yeonju7kim commented 10 months ago

I tried to do exps/finetune/sg/alpaca.sh. I thought that the model will be torch.bfloat16 if I set mixed_precision_dtype as torch.bfloat16. Because of the following codes. https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/9bd8b61a7df83f22d1f84aaeeb9dc2b98dd02a34/accessory/main_finetune.py#L230C1-L234C11

But it looks like model type is float32. Therefore the llama2-7b model consumes more than 28GB when it's loaded.

https://llama2-accessory.readthedocs.io/en/latest/finetune/quantization.html

In the above page, I saw that llama2-70b can be loaded in 145GB RAM of GPU. It means that the model type will be float16. I want to load the model llama2-7b and expect it will take up 14GB. However it was more than 28GB. It means the model type will be float32.

How can I change my model to float16?

linziyi96 commented 10 months ago

Thank you for your interest! The script exps/finetune/sg/alpaca.sh runs full parameter fine-tuning. For each tuned parameter, a FP32 master weight and two FP32 AdamW momentums have to be stored, adding up to 12 bytes per parameter. Thus, full parameter fine-tuning consumes substantially more memory compared to inference, in which only a single FP16/BF16 value (2 bytes) is needed per parameter.

To run full parameter fine-tuning, you will need at least 2 A100-80GB GPUs (or equivalent). Alternatively, you can check out some parameter-efficient fine-tuning (PEFT) methods (provided at exps/finetune/sg/alpaca_llamaAdapter.sh and exps/finetune/sg/alpaca_llamaPeft_*.sh). Since only 2 bytes per parameter are needed for each frozen parameter, PEFT methods consume significantly less memory compared to full parameter fine-tuning.

yeonju7kim commented 10 months ago

Thank you for your kind explanation. I could understand well. I should try with llama adapter. Thank you!

Alpha-VLLM / LLaMA2-Accessory

How can I change model tensor type to float16? #82