Regarding the memory usage of full-weight fine-tuning

Franciscus-Carolus commented 11 months ago

First of all, thank you for coming up with this great method. I am currently performing full-weight fine-tuning on llama-2-7b-hf on an NVIDIA A40 GPU. The GPU memory is about 44G and the available CPU memory is about 70G. (The GPU server in our laboratory is shared by multiple people. Most of the time, the CPU free memory is about 90G, but sometimes it becomes less. I could use two A40 GPUs if necessary, but the total CPU memory would not change) I followed your suggestion in other questions and set the "offload_optimizer_device" parameter to cpu. Adjust "per_device_train_batch_size" and "per_device_eval_batch_size" to 1, but in this case, the CPU's memory cannot meet the model requirements. Is there any way to successfully run full-weight fine-tuning under the current conditions?

fe1ixxu commented 11 months ago

Thanks for your interest!

You might consider exploring 'offloading parameters' in conjunction with DeepSpeed Stage 3. However, my understanding is that this approach could significantly slow down the process due to the communication overhead between CPUs and GPUs. Additionally, there's a minor concern regarding the stability of Stage 3, which could potentially lead to unpredictable performance in the model (from my experience).

Franciscus-Carolus commented 11 months ago

Thank you for your answer !

fe1ixxu / ALMA

Regarding the memory usage of full-weight fine-tuning #17