QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG]使用官方profile.py测试qwen-72b-int4模型,速度非常慢,远未达到官方的速度 #1021

Closed ArlanCooper closed 6 months ago

ArlanCooper commented 8 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior


import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
from transformers.trainer_utils import set_seed
from tqdm import tqdm

seed = 1024
max_experiment_times = 1
context_length_per_experiment = 1
generate_length_per_experiment = 2048
# context_length_per_experiment = 1
# generate_length_per_experiment = 8192
# context_length_per_experiment = 2048
# generate_length_per_experiment = 1
use_flash_attn = False
# fp32 without quantization, bf16 without quantization or int4 quantization (updated on 8.28: we have updated the quantization strategy to AutoGPTQ in this script)
quant_type = "int4" # fp32, bf16 or int4

set_seed(seed)

tokenizer = AutoTokenizer.from_pretrained("/data/share/rwq/Qwen-72B-Chat-Int4", trust_remote_code=True)
# 这里因为掉用官方代码会报错,所以修改成 AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "/data/share/rwq/Qwen-72B-Chat-Int4", 
    device_map="auto", 
    trust_remote_code=True
).eval()

# Specify hyperparameters for generation
config = GenerationConfig.from_pretrained("/data/share/rwq/Qwen-72B-Chat-Int4", trust_remote_code=True)
config.min_length = generate_length_per_experiment + context_length_per_experiment
config.max_new_tokens = generate_length_per_experiment

time_costs = []
context_str = '我' * context_length_per_experiment
max_gpu_memory_cost = 0
for _ in tqdm(range(max_experiment_times)):
    inputs = tokenizer(context_str, return_tensors='pt')
    inputs = inputs.to(model.device)
    t1 = time.time()
    pred = model.generate(**inputs, generation_config=config)
    time_costs.append(time.time() - t1)
    print(f"pred:{len(pred)}")
    print(f"config.min_length:{config.min_length}")
    assert pred.shape[1] == config.min_length
    max_gpu_memory_cost = max(max_gpu_memory_cost, torch.cuda.max_memory_allocated())
    torch.cuda.empty_cache()

print("Average generate speed (tokens/s): {}".format((max_experiment_times * generate_length_per_experiment) / sum(time_costs)))
print(f"GPU Memory cost: {max_gpu_memory_cost / 1024 / 1024 / 1024}GB")
print("Experiment setting: ")
print(f"seed = {seed}")
print(f"max_experiment_times = {max_experiment_times}")
print(f"context_length_per_experiment = {context_length_per_experiment}")
print(f"generate_length_per_experiment = {generate_length_per_experiment}")
print(f"use_flash_attn = {use_flash_attn}")
print(f"quant_type = {quant_type}")

测试结果:


pred:1
config.min_length:2049
Average generate speed (tokens/s): 0.5636905039032049
GPU Memory cost: 10.393836498260498GB
Experiment setting: 
seed = 1024
max_experiment_times = 1
context_length_per_experiment = 1
generate_length_per_experiment = 2048
use_flash_attn = False
quant_type = int4

官方给到的速度是11.32tokens/s, 而我这边测试的,速度就是0.5636905039032049tokens/s

请问一下原因,我这边使用的是3块A100 80g的卡。

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:ubuntu22.04
- Python:3.10.12
- Transformers:4.36.2
- PyTorch:2.1.2+cuda118
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.8

备注 | Anything else?

No response

jklj077 commented 8 months ago
  1. transformers的多卡并行推理是原始的model parallel,每时只有一个卡在工作,比单卡要慢的。
  2. auto-gptq安装可能有问题,没匹配到kernel。
ArlanCooper commented 8 months ago
  1. transformers的多卡并行推理是原始的model parallel,每时只有一个卡在工作,比单卡要慢的。
  2. auto-gptq安装可能有问题,没匹配到kernel。 针对问题1,我这边可以尝试一下单卡; 针对问题2,auto-gptq安装可能有问题是指?也就是说,需要通过
    
    from auto_gptq import AutoGPTQForCausalLM
    model = AutoGPTQForCausalLM.from_quantized(
    "/data/share/rwq/Qwen-7B-Chat-Int4", 
    device="cuda:1", 
    trust_remote_code=True, 
    use_safetensors=True, 
    use_flash_attn=use_flash_attn
    ).eval()

读入模型,速度会快一些吗?
ArlanCooper commented 8 months ago
  1. transformers的多卡并行推理是原始的model parallel,每时只有一个卡在工作,比单卡要慢的。
  2. auto-gptq安装可能有问题,没匹配到kernel。

我这边使用单卡试了一下,也还是很慢:

pred:1
config.min_length:2049
Average generate speed (tokens/s): 0.6753895362379109
GPU Memory cost: 49.416122913360596GB
Experiment setting: 
seed = 1024
max_experiment_times = 1
context_length_per_experiment = 1
generate_length_per_experiment = 2048
use_flash_attn = False
quant_type = int4

没快多少

qingqinghu1026 commented 7 months ago

我也碰到了同样的问题,请问有解决吗? 我的显卡是Quadro M5000 8G*2张,推理速度是0.5tokens/s

jklj077 commented 6 months ago

https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md

linzy5 commented 4 months ago

我在Qwen-1.5也遇到同样的问题,使用https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md 这个链接下对应的auto-gptq版本(选择对应的cuda和pytorch版本)解决此问题

ArlanCooper commented 4 months ago

我在Qwen-1.5也遇到同样的问题,使用https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md 这个链接下对应的auto-gptq版本(选择对应的cuda和pytorch版本)解决此问题

好的,感谢,我试一下

Gary828 commented 1 month ago

我在Qwen-1.5也遇到同样的问题,使用https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md 这个链接下对应的auto-gptq版本(选择对应的cuda和pytorch版本)解决此问题

好的,感谢,我试一下

您好,请问解决了吗?