[BUG] 对qwen-7b模型微调后，输出句子断句不正常，直接从句子中间停止

twwch commented 3 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

对qwen-7b模型微调后，输出句子断句不正常，直接从句子中间停止 img_v3_029d_06cdbae5-3b79-4227-a572-4db5bdcf24cg

而且感觉他生成了几组，用\n给我隔开了

期望行为 | Expected Behavior

只输出一组句子并且正常断句

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

jklj077 commented 3 months ago

please provide steps to reproduce.

twwch commented 3 months ago

训练脚本

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

# Guide:
# This script supports distributed training on multi-gpu workers (as well as single-worker training).
# Please set the options below according to the comments.
# For multi-gpu workers training, these options should be manually set for each worker.
# After setting the options, please run the script on each worker.

# Number of GPUs per GPU worker
GPUS_PER_NODE=$(python -c 'import torch; print(torch.cuda.device_count())')

# Number of GPU workers, for single-worker training, please set to 1
NNODES=${NNODES:-1}

# The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
NODE_RANK=${NODE_RANK:-0}

# The ip address of the rank-0 worker, for single-worker training, please set to localhost
MASTER_ADDR=${MASTER_ADDR:-localhost}

# The port for communication
MASTER_PORT=${MASTER_PORT:-6001}

MODEL="Qwen/Qwen-7B" # Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="path_to_data"

function usage() {
    echo '
Usage: bash finetune/finetune_ds.sh [-m MODEL_PATH] [-d DATA_PATH]
'
}

while [[ "$1" != "" ]]; do
    case $1 in
        -m | --model )
            shift
            MODEL=$1
            ;;
        -d | --data )
            shift
            DATA=$1
            ;;
        -h | --help )
            usage
            exit 0
            ;;
        * )
            echo "Unknown argument ${1}"
            exit 1
            ;;
    esac
    shift
done

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --model_name_or_path /data1/chenhao/models/Qwen/Qwen-7B \
    --data_path ./data/qwen_train.json \
    --bf16 True \
    --output_dir output_qwen2 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1000 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed finetune/ds_config_zero3.json

推理脚本

import time

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/data1/chenhao/codes/Qwen/output_qwen"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
model.eval()

input_text = """
你是一个使用先进的自然语言处理技术能快速准确地概括文档的人工智能助手，善于对文档合并重复内容、提炼重点内容，具备对文档内容深度理解的能力，并能有效识别文档的核心内容，排除掉与文章标题不相关的干扰。文档内容如下：

3 优化使用了AdamW优化器进行训练。β1和β2分别设置为0.9和0.95。我们使用了权重衰减为0.1，并将梯度范数剪切为0.5。模型在2,000个线性缩放步骤后进行预热，达到最大学习率，然后应用余弦衰减到最小学习率。参数细节和学习率如下表所示。

Baichuan2的模型细节整个模型使用BFloat16混合精度进行训练。与Float16相比，BFloat16具有更好的动态范围，使其对训练大语言模型中关键的大值更加稳健。然而，BFloat16的低精度在某些情况下会引发问题。例如，在某些公共RoPE和ALibi实现中，当整数超过256时，torch.arange操作会由于碰撞而失败，从而阻止了对附近位置的微分。因此，我们对某些值敏感的操作，如位置嵌入，使用全精度。

NormHead：为了稳定训练并提高模型性能，我们对输出嵌入（也称为“head”）进行了归一化处理。在我们的实验中，NormHead有两个优点。首先，在初步实验中，我们发现头部的范数容易不稳定。稀有标记的嵌入的范数在训练期间变小，扰乱了训练动态。NormHead可以显著稳定动态。其次，我们发现语义信息主要由嵌入的余弦相似性而不是L2距离编码。由于当前的线性分类器通过点积计算logits，这是L2距离和余弦相似性的混合。NormHead减轻了在计算logits时L2距离的干扰。

Max-z loss：在训练过程中，我们发现LLM的logits可能会变得非常大。虽然softmax函数对绝对logit值是不可知的，因为它仅依赖于它们的相对值。但大的logits在推理过程中会引发问题，因为常见的重复惩罚实现（例如model.generate中的Hugging Face实现）将标量（例如1.1或1.2）直接应用于logits。以这种方式收缩非常大的logits可以显着改变softmax之后的概率，使模型对重复惩罚超参数的选择变得敏感。受NormSoftmax和PaLM中的辅助z-loss启发，添加了max-z loss来规范logits：

其中z是最大的logit值。这有助于稳定训练并使推断更加稳健，不容易受到超参数的影响。

4 Scaling law 神经尺度律是指，误差随着训练集大小、模型大小或两者之间的幂函数关系而减小，这种尺度律在深度学习和大语言模型中的训练变得越来越昂贵时，已经取得了令人满意的性能。在训练数十亿参数的大型语言模型之前，我们首先训练了一些小型模型，并拟合了训练更大模型的尺度律。我们启动了一系列模型大小，从10M到3B不等，相对于最终模型的大小，范围从1/1000到1/10，每个模型最多训练1万亿个token，使用一致的超参数和相同的数据集，数据集来自Baichuan 2。根据不同模型的最终损失，我们可以得到从训练flops到目标损失的映射。

为了拟合模型的尺度律，我们采用了Henighan等人（2020）提供的公式：

其中L∞是不可减小的损失，第一项是可减小的损失，它被公式化为幂律缩放项。C是训练flops，是该flops下模型的最终损失。我们使用SciPy库的curve_fit函数来拟合参数。最终拟合的尺度曲线和预测的70亿和130亿参数模型的最终损失显示在下图中。我们可以看到，拟合的尺度律高度准确地预测了Baichuan 2的最终损失。

Baichuan2的尺度率。们训练了各种模型，从1000万到30亿个参数，1万亿个token。通过将幂律项拟合到给定训练失败的损失，我们预测了在2.6万亿token上训练Baichuan2-7B和Baichuan2-13B的损失。这个拟合过程精确地预测了最终模型的损失(用两颗星标记)。 5 算力开发了一种协同设计方法，包括一个弹性训练框架和智能集群调度策略。

由于GPU被多个用户和任务共享，每个任务的具体行为是不可预测的，这经常导致集群中存在空闲的GPU节点。考虑到单台配备八个A800 GPU的机器可以充分满足Baichuan 7B和Baichuan 13B模型的内存需求，因此我们训练框架的主要设计标准是机器级别的弹性，它支持根据集群状态动态修改任务的资源分配，从而为智能调度算法奠定了基础。

为满足机器级别的弹性要求，训练框架集成了张量并行和由ZeRO提供支持的数据并行。在每台机器内设置张量并行，并使用ZeRO共享数据并行ism来实现跨机器的弹性扩展。

采用了张量分裂技术，通过将某些计算进行分裂，以减少峰值内存消耗，比如具有大词汇表的交叉熵计算。

应用混合精度训练，在BFloat16中执行前向和反向计算，同时在Float32中执行优化器更新。此外，为了有效地将训练集群扩展到数千个GPU，整合了以下技术，以避免通信效率下降：

1）基于拓扑的分布式训练。在大规模集群中，网络连接通常跨越多层交换机。我们策略性地安排了分布式训练的排名，以最小化跨不同交换机的频繁访问，从而降低延迟，提高整体训练效率。

2） ZeRO的混合和分层分区。通过在GPU之间分区参数，ZeRO3减少了内存消耗，但增加了额外的全局聚集通信。当扩展到数千个GPU时，这种方法可能导致显著的通信瓶颈。为了解决这个问题，我们提出了一种混合和分层分区方案。具体而言，我们的框架首先将优化器状态分区到所有GPU上，然后自适应地决定哪些层需要激活ZeRO3，以及是否分层分区参数。

通过整合这些策略，我们的系统能够在1,024个NVIDIA A800 GPU上高效地训练Baichuan 2-7B和Baichuan 2-13B模型，实现的计算效率超过180 TFLOPS。

6 对齐 Baichuan2的两个对话模型：Baichuan 2-7B-Chat和Baichuan 2-13B-Chat。Baichuan 2的对齐过程包括两个主要组成部分：监督微调（SFT）和从人类反馈中进行的强化学习（RLHF）。

1）有监督微调

在监督微调阶段，我们使用人类标注员对从各种数据源收集的提示进行标注。根据类似于Claude（2023）的关键原则，每个提示都被标记为有用或无害。为了验证数据质量，我们使用交叉验证，一个权威标注员检查由特定众包工作者组标注的样本批次的质量，拒绝不符合我们的质量标准的批次。

我们收集了超过100,000个监督微调样本，并在它们上训练了我们的基础模型。接下来，我们通过RLHF方法明确了强化学习过程，以进一步改进结果。包括RM和RL训练在内的整个RLHF过程如下图所示。

"""

start = time.time()
inputs = tokenizer(input_text, return_tensors='pt', max_length=8192, truncation=True)
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, num_return_sequences=1, repetition_penalty=1.1)
output = tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)
output = output.replace(input_text, "")

print(output)

print("cost time: ", time.time() - start)

jklj077 commented 3 months ago

The finetune.py in this repo produces a chat model that needs a specific template to work, even if you're finetuing from the base model.

It seems that you're using the model.generate which does not take account of the chat template. Please refer to the README on how to use the model.chat method.

twwch commented 3 months ago

The finetune.py in this repo produces a chat model that needs a specific template to work, even if you're finetuing from the base model.

It seems that you're using the model.generate which does not take account of the chat template. Please refer to the README on how to use the model.chat method.

如何对base model进行全参微调呢？

jklj077 commented 3 months ago

You could modify the preprocess functions in finetune.py and adapt it according to your usecase. You may also need to update the inference code in accordance (typically involving the stopping criteria).

QwenLM / Qwen