EricFillion / happy-transformer

Happy Transformer makes it easy to fine-tune and perform inference with NLP Transformer models.
http://happytransformer.com
Apache License 2.0
517 stars 66 forks source link

Memory exhausting #333

Open MrSplinterRat opened 1 year ago

MrSplinterRat commented 1 year ago

Good afternoon!

I followed all the steps described in the article https://www.vennify.ai/llama-2-fine-tuning/ and video https://youtu.be/I4ZLlzkMRvA in configurations 1xA6000 8xA6000 1xA100 on the runpod.io platform, and V100 32Gb in our own cloud. Models Llama-2 and GPT-J were tested And each time the result was similar to the following, down to the numbers:


torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.10 GiB (GPU 0; 79.15 GiB total capacity; 62.79 GiB already allocated; 15.26 GiB free; 62.79 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-09-07 13:55:11,220] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 15064 [2023-09-07 13:55:11,220] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', './ht_01.py', '-- local_rank=0'] exits with return code = 1


(this particular example is taken from a test on the A100 on runpod.io) That is, the video memory, regardless of its quantity, was completely exhausted, and the program ended with an error.

Please tell me what could be the cause of this failure?