sft.py运行报错 CUDA out of memory，请问咋解决？

qxj commented 12 months ago

运行日志：

(llama2) [llama2-chinese]$ python sft.py 
tokens per iteration will be: 16,384
breaks down as: 1 grad accum steps * 1 processes * 32 batch size * 512 max seq len
                                                   prompt                                             answer
757309  选择以下列表中的一个数学公式并解释它，“a² + b² = c²”、“y = mx + b”...  \n“a² + b² = c²” 表示勾股定理，用于计算直角三角形的斜边长度。\n“y = ...
31228            给出一句话，用另一种语言（如法语、德语等）重新表达。\n生命中最重要的事情是什么  Quelle est la chose la plus importante dans la...
227106  描述如何制作一杯拿铁咖啡，包括所需材料和步骤。 \n所需材料： \n- 2盎司浓缩咖啡 \n...  步骤：\n1. 准备好所需材料。\n2. 在咖啡杯中倒入2盎司的浓缩咖啡。\n3. 在另一个...
53255   提供两个类别，例如“A”和“B”，该为一组数据点分配这两个类别之一，并给出理由。\n类别1：...  数据点1属于产品设计类别，因为它涉及产品的安全和设计方面，需要重新设计产品形状以减少意外伤害...
752602                               提供一份食谱\n煎虾饼需要哪些材料？\n              煎虾饼的材料通常包括虾仁、豆腐、鸡蛋、淀粉、调味品（盐、胡椒粉、姜末等）。
...                                                   ...                                                ...
303642  给定一段文本，编写一个python函数，计算其中单词的数量。\n“编程是一项非常有趣的技能，...  以下是一个计算文本中单词数量的Python函数：\n```\ndef count_words...
560061  给定一段格式混乱的文本，请将其按照规定的格式进行排版，并输出排版后的结果。\n标题: 世界闻...  标题：世界闻名的科学家\n文本：爱因斯坦、牛顿和霍金都是伟大的科学家，他们所做出的贡献推动了...
642915  给定一段文本，请问其中出现最多的单词是什么？\n文本: 散步是我最喜欢的活动之一。我发现它可...                                       出现最多的单词是“我”。
227969  根据给定的文本情感，提供情感分析结果和可信度得分。\n文本："我喜欢这个电影，演员表现得非常...                                 情感分析结果：积极\n可信度得分：高
45020   为下列一段文本生成一个简洁的标题。\n文本: 这个夏天，因为天气炎热和各种植物的成长，在我们...                                         夏日花园里的多彩花朵

[802899 rows x 2 columns]
Initializing a new model from scratch
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
num decayed parameter tensors: 85, with 218,129,408 parameters
num non-decayed parameter tensors: 25, with 25,600 parameters
using fused AdamW: False
/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
[2023-09-04 10:49:52,275][sft.py][INFO] Epoch:[0/2](0/25091) loss:2.822 lr:0.0000000 epoch_Time:759.0min:
Traceback (most recent call last):
  File "sft.py", line 323, in <module>
    train_epoch(epoch)
  File "sft.py", line 75, in train_epoch
    scaler.scale(loss).backward()
  File "/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 39.59 GiB total capacity; 33.25 GiB already allocated; 2.56 GiB free; 35.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ysinwell commented 11 months ago

我也有同样的问题运行日志如下

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 23.69 GiB total capacity; 18.23 GiB already allocated; 3.33 GiB free; 18.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

能够完成pretrain，但是在sft过程中会出现显卡内存溢出

qxj commented 11 months ago

我也有同样的问题运行日志如下

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 23.69 GiB total capacity; 18.23 GiB already allocated; 3.33 GiB free; 18.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

能够完成pretrain，但是在sft过程中会出现显卡内存溢出

和我情况一样哎，@DLLXW 帮忙看看？

ysinwell commented 11 months ago

我也有同样的问题运行日志如下
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 23.69 GiB total capacity; 18.23 GiB already allocated; 3.33 GiB free; 18.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
能够完成pretrain，但是在sft过程中会出现显卡内存溢出
和我情况一样哎，@DLLXW 帮忙看看？

我后面试了一下，把sft的batchsize调小就好了

Vincent-ZHQ commented 11 months ago

显存溢出，调小batch就好，看你日志还有个无法调用Flash Attention的warning，需要升级下PyTorch到2.0，训练速度会快些，显存占用也稍微小些

qxj commented 11 months ago

显存溢出，调小batch就好，看你日志还有个无法调用Flash Attention的warning，需要升级下PyTorch到2.0，训练速度会快些，显存占用也稍微小些

谢谢，后来的确调小batch解决了。不过pytorch2.0训练nan一直没查出原因，所以还在用1.x https://github.com/DLLXW/baby-llama2-chinese/issues/17#issuecomment-1706116677

Vincent-ZHQ commented 11 months ago

我刚刚也遇到了，微调昨天的预训练模型，发现全是nan，看了下做昨天的预训练日志后面也全是nan，过早重跑，结果日志还被覆盖了，目前还没复现出在哪出现的

DLLXW / baby-llama2-chinese

sft.py运行报错 CUDA out of memory，请问咋解决？ #22