请教，qlora微调没效果

使用docker镜像`qwenllm/qwen:cu121`对`Qwen-7B-Chat-Int4`进行微调，docker命令如下：

IMAGE_NAME=qwenllm/qwen:cu121
CHECKPOINT_PATH=/home/deploy/qwen/Qwen-7B-Chat-Int4     # 下载的模型和代码路径 (Q-LoRA)
DATA_PATH=/home/deploy/qwen-finetune                    # 准备微调数据放在 ${DATA_PATH}/example.json
OUTPUT_PATH=/home/deploy/qwen-finetune/output/checkpoint          # 微调输出路径

# 如果需要指定用于训练的GPU，按照以下方式设置device（注意：内层的引号不可省略）
DEVICE='"device=0"'

mkdir -p ${OUTPUT_PATH}

# 单卡LoRA微调
docker run --gpus ${DEVICE} --rm --name qwen \
    --mount type=bind,source=${CHECKPOINT_PATH},target=/data/shared/Qwen/Qwen-7B-Chat-Int4 \
    --mount type=bind,source=${DATA_PATH},target=/data/shared/Qwen/data \
    --mount type=bind,source=${OUTPUT_PATH},target=/data/shared/Qwen/output_qwen \
    --shm-size=2gb \
    -it ${IMAGE_NAME} \
    bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/Qwen/Qwen-7B-Chat-Int4/ -d /data/shared/Qwen/data/example.json

微调example.json内容如下：

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "张三有限责任公司电话多少？"
      },
      {
        "from": "assistant",
        "value": "电话是18888888888，座机是0001-8888888。"
      },
      {
        "from": "user",
        "value": "张三有限责任公司联系方式？"
      },
      {
        "from": "assistant",
        "value": "公司电话是18888888888，座机是0001-8888888。"
      },
      {
        "from": "user",
        "value": "张三有限责任公司在哪？"
      },
      {
        "from": "assistant",
        "value": "位于北京市大兴区科创五街38号"
      }
    ]
  }
]

显卡是`RTX 4090 24G`，运行日志如下：

==========
== CUDA ==
==========

CUDA Version 12.1.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

[2024-04-03 02:45:03,643] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/usr/local/lib/python3.8/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.
You passed `quantization_config` to `from_pretrained` but the model you're loading already has a `quantization_config` attribute and has already quantized weights. However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
Try importing flash-attention for faster inference...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.08s/it]
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
trainable params: 143,130,624 || all params: 1,388,056,576 || trainable%: 10.311584302454254
Loading data...
Formatting inputs...Skip in lazy mode
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Using /root/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu121/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu121/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 31.454748392105103 seconds
/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
  0%|                                                                                                                                                                        | 0/5 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
 20%|████████████████████████████████                                                                                                                                | 1/5 [00:02<00:10,  2.65s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.1473, 'learning_rate': 0, 'epoch': 1.0}                                                                                                                                                 
 20%|████████████████████████████████                                                                                                                                | 1/5 [00:02<00:10,  2.65s/it]/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
 40%|████████████████████████████████████████████████████████████████                                                                                                | 2/5 [00:03<00:04,  1.50s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.1473, 'learning_rate': 0, 'epoch': 2.0}                                                                                                                                                 
 60%|████████████████████████████████████████████████████████████████████████████████████████████████                                                                | 3/5 [00:03<00:02,  1.11s/it]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.1473, 'learning_rate': 0, 'epoch': 3.0}                                                                                                                                                 
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                | 4/5 [00:04<00:00,  1.09it/s]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.1473, 'learning_rate': 0, 'epoch': 4.0}                                                                                                                                                 
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.24it/s]tried to get lr value before scheduler/optimizer started stepping, returning lr=0
{'loss': 0.1473, 'learning_rate': 0, 'epoch': 5.0}                                                                                                                                                 
{'train_runtime': 5.2299, 'train_samples_per_second': 0.956, 'train_steps_per_second': 0.956, 'train_loss': 0.1473388671875, 'epoch': 5.0}                                                         
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.05s/it]
[root@localhost qwen-finetune]#

我使用docker镜像qwenllm/qwen:cu117启动模型，我更改了openai_api.py文件，改为使用`AutoPeftModelForCausalLM`加载模型，如下：

if __name__ == "__main__":
    args = _get_args()

    tokenizer = AutoTokenizer.from_pretrained(
        args.checkpoint_path,
        trust_remote_code=True,
        resume_download=True,
    )

    if args.api_auth:
        app.add_middleware(
            BasicAuthMiddleware, username=args.api_auth.split(":")[0], password=args.api_auth.split(":")[1]
        )

    if args.cpu_only:
        device_map = "cpu"
    else:
        device_map = "auto"

#    model = AutoModelForCausalLM.from_pretrained(
#        args.checkpoint_path,
#        device_map=device_map,
#        trust_remote_code=True,
#        resume_download=True,
#    ).eval()

    model = AutoPeftModelForCausalLM.from_pretrained(
        args.checkpoint_path,
        device_map="auto",
        trust_remote_code=True
    ).eval()

#    model.generation_config = GenerationConfig.from_pretrained(
#        args.checkpoint_path,
#        trust_remote_code=True,
#        resume_download=True,
#    )

    uvicorn.run(app, host=args.server_name, port=args.server_port, workers=1)

测试模型对话和微调前一样，如下：

{
    "model": "qwen",
    "stream": false,
    "messages": [
        {
            "role": "user",
            "content": "张三有限责任公司电话多少"
        }
    ]
}

{
    "model": "qwen",
    "object": "chat.completion",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "很抱歉，作为人工智能模型，我无法获取和提供个人或公司的联系方式。建议您通过其他途径（如官方网站、社交媒体等）查找相关信息。同时，请注意保护个人隐私和信息安全。",
                "function_call": null
            },
            "finish_reason": "stop"
        }
    ],
    "created": 1712113126
}

QwenLM / Qwen

请教，qlora微调没效果 #1188

使用docker镜像`qwenllm/qwen:cu121`对`Qwen-7B-Chat-Int4`进行微调，docker命令如下：

微调example.json内容如下：

显卡是`RTX 4090 24G`，运行日志如下：

我使用docker镜像qwenllm/qwen:cu117启动模型，我更改了openai_api.py文件，改为使用`AutoPeftModelForCausalLM`加载模型，如下：

测试模型对话和微调前一样，如下：

请教是哪个环节有问题

QwenLM / Qwen

请教，qlora微调没效果 #1188

使用docker镜像qwenllm/qwen:cu121对Qwen-7B-Chat-Int4进行微调，docker命令如下：

微调example.json内容如下：

显卡是RTX 4090 24G，运行日志如下：

我使用docker镜像qwenllm/qwen:cu117启动模型，我更改了openai_api.py文件，改为使用AutoPeftModelForCausalLM加载模型，如下：

测试模型对话和微调前一样，如下：

请教是哪个环节有问题

使用docker镜像`qwenllm/qwen:cu121`对`Qwen-7B-Chat-Int4`进行微调，docker命令如下：

显卡是`RTX 4090 24G`，运行日志如下：

我使用docker镜像qwenllm/qwen:cu117启动模型，我更改了openai_api.py文件，改为使用`AutoPeftModelForCausalLM`加载模型，如下：