[Bug]: Nvidia L20推理Qwen2.5 72B GPTQ-Int8模型不符合预期

renne444 commented 1 month ago

Model Series

Qwen2

What are the models used?

Qwen2.5-72B-Instruct-GPTQ-Int8 and Qwen2-72B-Instruct-GPTQ-Int8

What is the scenario where the problem happened?

transformers

Is this a known issue?

[X] I have followed the GitHub README.
[X] I have checked the Qwen documentation and cannot find an answer there.
[X] I have checked the documentation of the related framework and cannot find useful information.
[X] I have searched the issues and there is not a similar one.

Information about environment

Python: 3.10
GPUs: 8 x NVIDIA L20
NVIDIA driver: 535.161.07
CUDA compiler: cuda_12.6

Package Version Editable project location

accelerate 0.34.2 aiohappyeyeballs 2.4.2 aiohttp 3.10.8 aiosignal 1.3.1 async-timeout 4.0.3 attrs 24.2.0 auto_gptq 0.8.0.dev0+cu121 /mnt/download/AutoGPTQ certifi 2024.8.30 charset-normalizer 3.3.2 coloredlogs 15.0.1 datasets 3.0.1 dill 0.3.8 filelock 3.16.1 frozenlist 1.4.1 fsspec 2024.6.1 gekko 1.2.1 huggingface-hub 0.25.1 humanfriendly 10.0 idna 3.10 Jinja2 3.1.4 MarkupSafe 2.1.5 mpmath 1.3.0 multidict 6.1.0 multiprocess 0.70.16 networkx 3.3 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 optimum 1.22.0 packaging 24.1 pandas 2.2.3 peft 0.13.0 pip 24.2 protobuf 5.28.2 psutil 6.0.0 pyarrow 17.0.0 python-dateutil 2.9.0.post0 pytz 2024.2 PyYAML 6.0.2 regex 2024.9.11 requests 2.32.3 rouge 1.0.1 safetensors 0.4.5 sentencepiece 0.2.0 setuptools 75.1.0 six 1.16.0 sympy 1.13.3 threadpoolctl 3.5.0 tokenizers 0.19.1 torch 2.4.1 tqdm 4.66.5 transformers 4.44.2 triton 3.0.0 typing_extensions 4.12.2 tzdata 2024.2 urllib3 2.2.3 wheel 0.44.0 xxhash 3.5.0 yarl 1.13.1

Log output

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [05:54<00:00, 17.71s/it]Traceback (most recent call last):
  File "/xxx/qwen_inference_loss_hugging_face.py", line 35, in <module>
    generated_ids = model.generate(
  File "/root/.conda/envs/gptq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/.conda/envs/gptq/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/root/.conda/envs/gptq/lib/python3.10/site-packages/transformers/generation/utils.py", line 3020, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Description

Steps to reproduce

I follow the sample here and test Qwen2 and Qwen2.5 with transformers and get an unexpected error. Below is the detailed source code:

我在一台8张L20的机器上模仿了例程，并在Qwen2和2.5版本的GPTQ-Int8上都不符合预期。并且测试过，在bf16模型上是符合预期的。

代码如下：

from transformers import AutoModelForCausalLM, AutoTokenizer

# model_path = get_default_model_path("bf16")
model_path = "/mnt/Qwen/Qwen2.5-72B-Instruct-GPTQ-Int8"

device = "cuda"

max_memory = {0: "24GiB", 1: "34GiB", 2: "34GiB", 3: "34GiB", 4: "34GiB", 5: "34GiB", 6: "34GiB", 7: "34GiB"}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    max_memory=max_memory,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
attention_mask = model_inputs['attention_mask'].to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    attention_mask=attention_mask
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

jklj077 commented 4 weeks ago

Please first try upgrading the driver. We had similar reports from users using multiple RTX 4090s (also Ada Lovelace cards).

In addition, I'm not sure auto_gptq works with torch 2.4.1.

renne444 commented 4 weeks ago

@jklj077

十分感谢支持！

我已经根据Dockerfile里的版本信息，安装了对应版本的依赖库环境，安装了2.2.2版本的torch环境。并且在A100和L20上都分别使用先前提到的例程运行了0.5B、7B、72B版本的Qwen2.5模型，都出现了可以在vllm下可以跑通，但使用提供的transformers样例跑不通的情况。并且这个现象在单卡和多卡都存在。

我们A100机器驱动版本为525.105.17，L20机器驱动版本为535.161.07。在我们的云服务环境中，升级驱动可能会有很大的代价。

请问是否可以用现有版本的驱动，在Ampere或Ada Lovelace架构中跑通提供的例程？

环境：

Package                  Version
------------------------ -----------
accelerate               1.0.0
aiohappyeyeballs         2.4.3
aiohttp                  3.10.9
aiosignal                1.3.1
async-timeout            4.0.3
attrs                    24.2.0
auto_gptq                0.7.1
autoawq                  0.2.5
autoawq_kernels          0.0.6
certifi                  2024.8.30
charset-normalizer       3.4.0
coloredlogs              15.0.1
datasets                 3.0.1
dill                     0.3.8
einops                   0.8.0
filelock                 3.16.1
frozenlist               1.4.1
fsspec                   2024.6.1
gekko                    1.2.1
huggingface-hub          0.25.2
humanfriendly            10.0
idna                     3.10
Jinja2                   3.1.4
MarkupSafe               3.0.1
mkl_fft                  1.3.10
mkl_random               1.2.7
mkl-service              2.4.0
mpmath                   1.3.0
multidict                6.1.0
multiprocess             0.70.16
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.6.77
nvidia-nvtx-cu12         12.1.105
optimum                  1.20.0
packaging                24.1
pandas                   2.2.3
peft                     0.13.1
pillow                   10.4.0
pip                      24.2
propcache                0.2.0
protobuf                 5.28.2
psutil                   6.0.0
pyarrow                  17.0.0
python-dateutil          2.9.0.post0
pytz                     2024.2
PyYAML                   6.0.2
regex                    2024.9.11
requests                 2.32.3
rouge                    1.0.1
safetensors              0.4.5
scipy                    1.14.1
sentencepiece            0.2.0
setuptools               72.1.0
six                      1.16.0
sympy                    1.13.3
tiktoken                 0.8.0
tokenizers               0.19.1
torch                    2.2.2
torchaudio               2.2.2
torchvision              0.17.2
tqdm                     4.66.5
transformers             4.40.2
triton                   2.2.0
typing_extensions        4.12.2
tzdata                   2024.2
urllib3                  2.2.3
wheel                    0.44.0
xxhash                   3.5.0
yarl                     1.14.0
zstandard                0.23.0

jklj077 commented 4 weeks ago

could you try the solution to the second issue at https://qwen.readthedocs.io/en/v2.0/quantization/gptq.html#troubleshooting (we have not met this issue with Qwen2.5 yet).

In general, if vllm works but auto_gptq doesn't, we recommend using vllm.

gg22mm commented 2 weeks ago

我换的这样就成功了： 1、之前用：torch_dtype=torch.float16 现在不能用了。把它去掉发现没有报错，但是运行特别慢 model = AutoModelForCausalLM.from_pretrained("Qwen2.5-7B-Instruct",torch_dtype=torch.float16, device_map="auto") //报错 model = AutoModelForCausalLM.from_pretrained("Qwen2.5-7B-Instruct", device_map="auto") //没有报错，但是特别缓慢，有bug，所以放弃

2、换模型 Qwen2.5-7B-Instruct-GPTQ-Int4 也是一样报错然后换 Qwen2.5-7B-Instruct-GPTQ-Int8 并使用：torch_dtype="auto" //没有报错，但是特别缓慢，有bug，所以放弃

3、最后的方案：用 Qwen2.5-7B-Instruct-AWQ 成功,运行正常(完美解决) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct-AWQ") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct-AWQ",torch_dtype="auto", device_map="auto") #推荐使用：Qwen2.5-7B-Instruct-AWQ ，这个运行快，稳，但是发现钢印太猛了，没办跟1.5模型比呀。~~

Qwen2.5-7B-Instruct-AWQ效果如下（都是钢印）： 1729497572835

感觉没有之前的1.5好呢，怎么回事？

jklj077 commented 2 weeks ago

@gg22mm your problem appears entirely different.

renne444 commented 2 hours ago

@jklj077 感谢你的评论，我用你提到的方法，并且用qwenllm/qwen:2-cu121官方镜像测试了qwen2.5 0.5B-GPTQ-Int8的模型。在用AutoGPTQForCausalLM导入模型，并开启use_triton后，输出了!!!!!!!!!!!!!!!!!!!!!!!!!!!那样的异常信息。在hugging face框架上的测试，他们都能输出正常的结果，只不过GPTQ-Int8模型在推理时会出现提示CUDA extension not installed.，并且有较慢的推理速度，问题还是无法解决。

以下是我的详细代码，以及每一项的输出结果，以下代码都在qwenllm/qwen:2-cu121镜像中运行，并且操作系统和硬件环境与上面提到的保持一致：

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
import os

# 获取模型的绝对路径
int8_model_path = "/Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8"
bf16_model_path = "/Qwen/Qwen2.5-0.5B-Instruct"

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
device = "cuda"

def get_model_input(tokenizer):
    messages = [
        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
        {"role": "user", "content": "Give me a short introduction to large language model."}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    return model_inputs

def gptq_inference(model_path):
    model = AutoGPTQForCausalLM.from_quantized(
        model_path,
        use_triton=True,
    )
    model.to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    print(tokenizer.decode(model.generate(
        **tokenizer("auto_gptq is", return_tensors="pt").to(model.device),
        max_new_tokens=512
    )[0]))
    print("################ FINISH inference for gptq model from:", model_path)

def hf_inference(model_path):
    model = AutoModelForCausalLM.from_pretrained(
        model_path
    )
    model.to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model_input = get_model_input(tokenizer)
    print(tokenizer.decode(model.generate(
        **model_input,
        max_new_tokens=512
    )[0]))
    print("################ FINISH inference for hf model from:", model_path)

# 单独执行以下三条语句
# hf_inference(bf16_model_path)
# hf_inference(int8_model_path) 
gptq_inference(int8_model_path)

hf_inference(bf16_model_path)输出：

root@3cd663442f39:/code/llm_deploy# python qwen_inference_gptq.py
CUDA extension not installed.
CUDA extension not installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
Large Language Models (LLMs) are artificial intelligence systems that can generate human-like text based on input data. These models have the ability to understand and produce coherent and fluent responses to questions or prompts, as well as complete sentences with multiple clauses. LLMs are particularly useful in areas such as natural language processing, machine translation, chatbots, and virtual assistants. They rely on deep learning algorithms to learn patterns from vast amounts of data and use this knowledge to generate high-quality output.<|im_end|>
################ FINISH inference for hf model from: /Qwen/Qwen2.5-0.5B-Instruct

hf_inference(int8_model_path)输出

在0.5B中能有输出，但如果应用到72b中，速度会非常非常慢，还是没办法解决问题

root@3cd663442f39:/code/llm_deploy# python qwen_inference_gptq.py
CUDA extension not installed.
CUDA extension not installed.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
Sure! A large language model is an artificial intelligence (AI) system that can generate human-like text based on the input it receives. These models are designed to be highly accurate and versatile, capable of understanding and generating complex sentences with various levels of creativity and depth. Large language models are used in a wide range of applications, including natural language processing, machine translation, chatbots, and more. They are often compared to humans in terms of their ability to understand and respond to human language.<|im_end|>
################ FINISH inference for hf model from: /Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int

gptq_inference(int8_model_path)输出完全错误的信息，并且推理速度很慢

root@3cd663442f39:/code/llm_deploy# CUDA_VISIBLE_DEVICES=0 python qwen_inference_gptq.py
CUDA extension not installed.
CUDA extension not installed.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING - Exllamav2 kernel is not installed, reset disable_exllamav2 to True. This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source.
WARNING - CUDA kernels for auto_gptq are not installed, this will result in very slow inference speed. This may because:
1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
WARNING - ignoring unknown parameter in quantize_config.json: batch_size.
WARNING - ignoring unknown parameter in quantize_config.json: block_name_to_quantize.
WARNING - ignoring unknown parameter in quantize_config.json: cache_block_outputs.
WARNING - ignoring unknown parameter in quantize_config.json: dataset.
WARNING - ignoring unknown parameter in quantize_config.json: exllama_config.
WARNING - ignoring unknown parameter in quantize_config.json: max_input_length.
WARNING - ignoring unknown parameter in quantize_config.json: model_seqlen.
WARNING - ignoring unknown parameter in quantize_config.json: module_name_preceding_first_block.
WARNING - ignoring unknown parameter in quantize_config.json: modules_in_block_to_quantize.
WARNING - ignoring unknown parameter in quantize_config.json: pad_token_id.
WARNING - ignoring unknown parameter in quantize_config.json: quant_method.
WARNING - ignoring unknown parameter in quantize_config.json: tokenizer.
WARNING - ignoring unknown parameter in quantize_config.json: use_cuda_fp16.
WARNING - ignoring unknown parameter in quantize_config.json: use_exllama.
INFO - The layer lm_head is not quantized.
Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
################ FINISH inference for gptq model from: /Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8

针对以上现象，我有三个问题：

我是直接在bash下运行官方镜像，然后到/code目录下运行自己的脚本，这样的运行方式是否正确？

docker run --gpus="all" -v /path/Qwen/Qwen2-0.5B-Instruct-GPTQ-Int8:/Qwen2-72B -v /path/to/code:/code --ipc=host -it "qwenllm/qwen:2-cu121" "/bin/bash"

在官方镜像中，只要是调用GPTQ-Int8模型就会提示CUDA extension not installed的错误。这其实是我之前也遇到的错误，在自己下载AutoGPTQ库并自行编译后，就会出现issue前面提到的其他错误。请问应该如何解决这类环境问题？
上面的程序我在一台A100机器上运行过，也出现几乎相同错误，是否可以认为问题和Ada Lovelace架构无关？

jklj077 commented 1 hour ago

在hugging face框架上的测试，他们都能输出正常的结果，只不过GPTQ-Int8模型在推理时会出现提示CUDA extension not installed.，并且有较慢的推理速度，问题还是无法解决。

transformers relies on the auto_gptq package for GPTQ support. auto_gptq provides different implementations for GPTQ models, including a plain torch implementaion and efficient kernel implementations such as cuda-old, cuda, exllama, exllamav2, triton, tritonv2, and marlin. However, not all of the implmentations can be used by transformers.

"CUDA extension not installed." suggested that transformers could not use the efficient kernel implementations, but fell back to the plain implementation, which is slow but should be correct. If your output is normal, we can at least rule out problems with the model weights.

"CUDA extension not installed." should not happen for the docker image (at least for qwenlm:2-cu121), which ensures compatible versions among the CUDA compiler, torch, and auto_gptq. The warning is really unexpected if you used the docker image. I don't have clues on this and I'm tagging this as help wanted.

P.S.: Can you try using vllm<0.6.3?

QwenLM / Qwen2.5