OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
410 stars 58 forks source link

训练loss为NaN #107

Closed fuqianya closed 11 months ago

fuqianya commented 1 year ago

我在使用pipeline parallelismMoss-7B底座模型进行superivised finetuning,实验环境为4xV100,但是我发现在训练过程中,loss一直都是nan,请问可能是什么原因?

我的配置如下:

"""
使用CoLLie微调Moss-base模型
"""
import sys
sys.path.append('..')
import torch
from transformers import AutoTokenizer

from collie.config import CollieConfig

from collie.data import CollieDatasetForTraining
from collie.controller.trainer import Trainer
from collie.controller.evaluator import EvaluatorForPerplexity, EvaluatorForGeneration
from collie.models.moss import MossForCausalLM
from collie.utils.monitor import StepTimeMonitor, TGSMonitor, MemoryMonitor, LossMonitor, EvalMonitor
from collie.metrics import DecodeMetric, PPLMetric
from collie.module import GPTLMLoss

# 1. 设置路径
# 1.1 预训练模型路径
pretrained_model = "/pretrained_weights/moss-base-7b"

# 2. 设置配置
# 2.1 加载配置
config = CollieConfig.from_pretrained(pretrained_model, trust_remote_code=True,
                                      local_files_only=True)
config.tp_size = 1
config.dp_size = 1
config.pp_size = 4
config.use_flash = False
config.train_epochs = 1
config.eval_per_n_steps = 0
config.eval_per_n_epochs = 1 
config.train_micro_batch_size = 1
config.eval_batch_size = 1
config.gradient_accumulation_steps = 4

# 3. 设置tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model,
                                          trust_remote_code=True,
                                          local_files_only=True)

# 4. 加载数据集
train_dataset = [
    {
        'input': 'Collie is a python package for ',
        'output': 'finetuning large language models.'
    } for _ in range(10000)
]
train_dataset = CollieDatasetForTraining(train_dataset, tokenizer)
eval_dataset = train_dataset[:32]

# 5. 加载预训练模型
model = MossForCausalLM.from_pretrained(pretrained_model, config=config)

# 6. 设置优化器
# optimizer = Lomo(
#     model,
#     lr = 0.001,
#     clip_grad_norm = 5.0
# )

# Lomo与pp不兼容
optimizer = torch.optim.AdamW(model.parameters(), lr=9e-6)

# 7. 添加监视器
monitors = [
    StepTimeMonitor(config),
    TGSMonitor(config),
    MemoryMonitor(config),
    LossMonitor(config),
    EvalMonitor(config)
]

# 8. 添加Evaluator
evaluator_ppl = EvaluatorForPerplexity(
    model = model,
    config = config,
    dataset = eval_dataset,
    monitors = [
        EvalMonitor(config)
    ],
    metrics = {
        'ppl': PPLMetric()
    }
)
evaluator_decode = EvaluatorForGeneration(
    model = model,
    config = config,
    tokenizer = tokenizer,
    dataset = eval_dataset,
    monitors = [
        EvalMonitor(config)
    ],
    metrics = {
        'decode': DecodeMetric()
    }

)

# 9. 实例化trainer
trainer = Trainer(
    model = model,
    config = config,
    loss_fn = GPTLMLoss(-100),
    optimizer = optimizer,
    train_dataset = train_dataset,
    monitors = monitors,
    evaluators = [evaluator_ppl, evaluator_decode]
)
# 10. 训练/验证
trainer.train()

# CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 finetune_moss_base.py

输出结果如下:

image
KaiLv69 commented 1 year ago

你好,我猜原因是这句话对moss来说loss过大,可以尝试这几种解决方法:

  1. 设置config.ds_config={"bf16": {"enabled": True}}来使用bf16进行训练
  2. 换一个句子来overfit,比如中文或者更常见的英文句子
fuqianya commented 1 year ago

你好,经过验证应该不是第二种原因,我将数据集更改为

train_dataset = [
    {
        'input': '流浪地球的导演是',
        'output': '郭帆.'
    } for _ in range(10000)
]

loss仍为nan。

尝试了第一种解决方法,设置config.ds_config={"bf16": {"enabled": True}}或者config.ds_config={"fp16": {"enabled": True}} 会导致显存溢出。请问设置config.ds_config={"fp16": {"enabled": True}}的作用是开启混合精度训练吗?为什么开启之后显存占用反而变大了呢?

KaiLv69 commented 1 year ago

你好,你的权重是否有些问题?我从huggingface上下载的权重没有出现NaN的情况,代码如下:

model_name = "fnlp/moss-base-7b"
config = CollieConfig.from_pretrained(model_name, trust_remote_code=True)
config.tp_size = 1
config.dp_size = 1
config.pp_size = 1
config.train_epochs = 1
config.train_micro_batch_size = 1
config.gradient_accumulation_steps = 1
config.ds_config = {
    # "fp16": {"enabled": True},
}

model = LlamaForCausalLM.from_pretrained(model_name, config)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)

train_dataset = [
    {
        'input': 'Collie is a python package for ',
        'output': 'finetuning large language models.'
    } for _ in range(10000)
]
train_dataset = CollieDatasetForTraining(train_dataset, tokenizer=tokenizer)
optimizer = Lomo(
    model,
    clip_grad_norm=1.0,
    lr=1e-3,
    loss_scale_args={
        "init_scale": 2 ** 14,
    },
)

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    config=config,
    train_dataset=train_dataset,
)
trainer.train()
fuqianya commented 1 year ago

你好,我确保我的权重是从huggingface的repo中下载而得,我通过运行如下的推理代码验证了权重的正确性:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("fnlp/moss-base-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("fnlp/moss-base-7b", trust_remote_code=True).cuda()
model = model.eval()
inputs = tokenizer(["流浪地球的导演是"], return_tensors="pt")
for k,v in inputs.items():
outputs = model.generate(**inputs, do_sample=True, temperature=0.8, top_p=0.8, repetition_penalty=1.1, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
郭帆
主演分别是吴京和屈楚萧 还有李光洁刘德华等等
这电影可以说是目前国内科幻片的天花板了
票房也是突破50亿大关啦
小编真的非常期待这部电影呀
所以呢今天就给大家整理了关于影片中的很多细节图哦~
不知道大家有没有注意到呢

但是,由于我没有80G A100,仅有4x32G V100,所以我使用了pipeline parallelism,此外,由于pipeline parallelism1F1B机制,它无法与Lomo并用,所以您能否提供一个pipeline parallelism、loss正常的example?

iyakiii commented 11 months ago

有更新吗 bf16还是会nan?fp16咋样?我这llama2 fp16会nan 但bf16不会

yueg-security commented 11 months ago

有更新吗 bf16还是会nan?fp16咋样?我这llama2 fp16会nan 但bf16不会

llama-7b, one sentence overfit, fp16会nan, bf16不会