DLLXW / baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库;24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.
MIT License
2.42k stars 296 forks source link

sft.py运行报错 CUDA out of memory,请问咋解决? #22

Closed qxj closed 11 months ago

qxj commented 12 months ago

运行日志:

(llama2) [llama2-chinese]$ python sft.py 
tokens per iteration will be: 16,384
breaks down as: 1 grad accum steps * 1 processes * 32 batch size * 512 max seq len
                                                   prompt                                             answer
757309  选择以下列表中的一个数学公式并解释它,“a² + b² = c²”、“y = mx + b”...  \n“a² + b² = c²” 表示勾股定理,用于计算直角三角形的斜边长度。\n“y = ...
31228            给出一句话,用另一种语言(如法语、德语等)重新表达。\n生命中最重要的事情是什么  Quelle est la chose la plus importante dans la...
227106  描述如何制作一杯拿铁咖啡,包括所需材料和步骤。 \n所需材料: \n- 2盎司浓缩咖啡 \n...  步骤:\n1. 准备好所需材料。\n2. 在咖啡杯中倒入2盎司的浓缩咖啡。\n3. 在另一个...
53255   提供两个类别,例如“A”和“B”,该为一组数据点分配这两个类别之一,并给出理由。\n类别1:...  数据点1属于产品设计类别,因为它涉及产品的安全和设计方面,需要重新设计产品形状以减少意外伤害...
752602                               提供一份食谱\n煎虾饼需要哪些材料?\n              煎虾饼的材料通常包括虾仁、豆腐、鸡蛋、淀粉、调味品(盐、胡椒粉、姜末等)。
...                                                   ...                                                ...
303642  给定一段文本,编写一个python函数,计算其中单词的数量。\n“编程是一项非常有趣的技能,...  以下是一个计算文本中单词数量的Python函数:\n```\ndef count_words...
560061  给定一段格式混乱的文本,请将其按照规定的格式进行排版,并输出排版后的结果。\n标题: 世界闻...  标题:世界闻名的科学家\n文本:爱因斯坦、牛顿和霍金都是伟大的科学家,他们所做出的贡献推动了...
642915  给定一段文本,请问其中出现最多的单词是什么?\n文本: 散步是我最喜欢的活动之一。我发现它可...                                       出现最多的单词是“我”。
227969  根据给定的文本情感,提供情感分析结果和可信度得分。\n文本:"我喜欢这个电影,演员表现得非常...                                 情感分析结果:积极\n可信度得分:高
45020   为下列一段文本生成一个简洁的标题。\n文本: 这个夏天,因为天气炎热和各种植物的成长,在我们...                                         夏日花园里的多彩花朵

[802899 rows x 2 columns]
Initializing a new model from scratch
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0
num decayed parameter tensors: 85, with 218,129,408 parameters
num non-decayed parameter tensors: 25, with 25,600 parameters
using fused AdamW: False
/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
[2023-09-04 10:49:52,275][sft.py][INFO] Epoch:[0/2](0/25091) loss:2.822 lr:0.0000000 epoch_Time:759.0min:
Traceback (most recent call last):
  File "sft.py", line 323, in <module>
    train_epoch(epoch)
  File "sft.py", line 75, in train_epoch
    scaler.scale(loss).backward()
  File "/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/qxj/conda/envs/llama2/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 39.59 GiB total capacity; 33.25 GiB already allocated; 2.56 GiB free; 35.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ysinwell commented 11 months ago

我也有同样的问题 运行日志如下

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 23.69 GiB total capacity; 18.23 GiB already allocated; 3.33 GiB free; 18.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

能够完成pretrain,但是在sft过程中会出现显卡内存溢出

qxj commented 11 months ago

我也有同样的问题 运行日志如下

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 23.69 GiB total capacity; 18.23 GiB already allocated; 3.33 GiB free; 18.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

能够完成pretrain,但是在sft过程中会出现显卡内存溢出

和我情况一样哎,@DLLXW 帮忙看看?

ysinwell commented 11 months ago

我也有同样的问题 运行日志如下

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.95 GiB (GPU 0; 23.69 GiB total capacity; 18.23 GiB already allocated; 3.33 GiB free; 18.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

能够完成pretrain,但是在sft过程中会出现显卡内存溢出

和我情况一样哎,@DLLXW 帮忙看看?

我后面试了一下,把sft的batchsize调小就好了

Vincent-ZHQ commented 11 months ago

显存溢出,调小batch就好,看你日志还有个无法调用Flash Attention的warning,需要升级下PyTorch到2.0,训练速度会快些,显存占用也稍微小些

qxj commented 11 months ago

显存溢出,调小batch就好,看你日志还有个无法调用Flash Attention的warning,需要升级下PyTorch到2.0,训练速度会快些,显存占用也稍微小些

谢谢,后来的确调小batch解决了。不过pytorch2.0训练nan一直没查出原因,所以还在用1.x https://github.com/DLLXW/baby-llama2-chinese/issues/17#issuecomment-1706116677

Vincent-ZHQ commented 11 months ago

我刚刚也遇到了,微调昨天的预训练模型,发现全是nan,看了下做昨天的预训练日志后面也全是nan,过早重跑,结果日志还被覆盖了,目前还没复现出在哪出现的