OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.23k stars 822 forks source link

Out Of Memory Issue LISA #801

Closed harry7171 closed 4 months ago

harry7171 commented 5 months ago

Out Of Memory Issue in LISA

Hi ,

I have been trying to use LISA for finetuning for my specific domain data. Although I am not using LMFLOW instead using the DynamicLayerActivationCallback Class and then using it in HF trainer .

I have a 80GB A100 , and I am finetuning on the same , using Mistral 7b - FP32 bit which occupies 29GB of memory. But when I do trainer.train() it ramps up the GPU and gives OOM.

Below is the error - OutOfMemoryError: CUDA out of memory. Tried to allocate 490.00 MiB. GPU 0 has a total capacty of 79.14 GiB of which 461.88 MiB is free. Including non-PyTorch memory, this process has 78.59 GiB memory in use. Of the allocated memory 76.86 GiB is allocated by PyTorch, and 1.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

while going through traceback I didnt find any such clue, it threw at forward call in softmax function.

Trying to figure it out since few days, Please assist . Thanks

research4pan commented 5 months ago

Thanks for your interest in LMFlow! Loading the model alone requires 7B 4 byte/param = 28GB memory, while full parameter training requires 7B (4+12) = 144 GB memory. Using one's own DynamicLayerActivationCallback may still require the same kind of memory consumption if the optimizer is not reinitialized every time.

To enable training of such models, you may use LMFlow's implementation, or change float type to bf16, which should not affect the performance much. Please feel free to let us know if you encounter further problems regarding this issues. Hope this information can be helpful πŸ˜„

harry7171 commented 5 months ago

Hi , Thanks for your quick response @research4pan

yea you are right. i tried using bf16. and it did finetuned but i was using only 1 layer for LISA . while using 2 layers I was facing same issue , it did start training but it got OOM after 32 steps itself.

I am a bit confused about it. please let me know if I am missing something or going wrong.

research4pan commented 5 months ago

We recommend using DynamicLayerActivationCallback together with paged_adamw, which allows occasional OOM to be well handled. Hope that can be helpful πŸ˜„

harry7171 commented 4 months ago

Thanks @research4pan , i tried using bf16 for finetuning, its working well enough , will try paged_adamw and try that too Thanks alot for your help