Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.68k stars 170 forks source link

FusedAdam #85

Closed yeonju7kim closed 10 months ago

yeonju7kim commented 10 months ago

Hi! Thank you for developing this wonderful work.

I want to fully finetuning llama, but because the model has 2 optimizers, it is hard to load my model on the GPU. I want to know why there are 2 optimizers and if it is okay with only 1 optimizer. Thank you.

ChrisLiu6 commented 10 months ago

Hi! What does "there are 2 optimizers" mean? By design there should only be one.

yeonju7kim commented 10 months ago

https://github.com/Alpha-VLLM/LLaMA2-Accessory/issues/82

Sorry. I mean two FP32 AdamW momentums here. Does the FusedAdam consume twice memory? If I use AdamW instead of FusedAdam, can we reduce the memory usage?

ChrisLiu6 commented 10 months ago

These two state variables are required by the Adam algorithm per se, and is irrelevant to the exact implementation (Apex FusedAdam / torch AdamW). So you may not be able to save GPU memory from this aspect. If you could provide the number and type of GPUs you use, we might be able to offer you some more suggestions.

yeonju7kim commented 10 months ago

Thank you. My GPU is 3 A6000, 40GB. I trained my task with llama_adapter, but it didn't adapt to my task. Do you have any suggestion?

ChrisLiu6 commented 10 months ago

Full-parameter finetuning a 7B model is pretty hard given your hardware. Specifically, it involves 28GB memory for parameters, 28*2GB memory for optimizer states, and 28GB memory for gradients. When trained on 3 GPUs, the memory load on each GPU would be approximately 28 + 283/3 = 56G when using SDP, and more than (28 4) / 3 = 37GB when using FSDP, without considering the cost for intermediate results.

The following are my suggestions:

  1. There do exist techniques like CPU offloading that could further lower the GPU memory cost. If you do need full-parameter finetuning, you can turn to such methods. Ideally, it should be no harder than modifying some function parameters to make CPU offloading work on Accessory (you may refer to this document for details), but we have yet to explore this.
  2. On the other hand, PEFT settings should be more suitable for you. Do you mean that you can succesfully train llama-adapter but the performance is unsatisfactory? If so, you may consider other PEFT techniques like bias-norm-lora turning.
yeonju7kim commented 10 months ago

Thank you so much for your help. It really helps me. I should try PEFT techniques.