QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] 四卡 3090 无法进行推理 #898

Closed chopin1998 closed 9 months ago

chopin1998 commented 9 months ago

当前行为 | Current Behavior

除了72b的几个无法在单卡上运行外, 其它模型用单卡都可以完整运行。

/////////////

一旦到多卡(无论是两个还是四个), 模型加载都可以, nvidia-smi 可以看到显存几乎被分摊,但是一旦运行chat或者其它实际的推理过程,就会出错,docker环境和 native环境都试过

  File "/home/marco/Build/nlp/Qwen/cli_demo.py", line 210, in <module>
    main()
  File "/home/marco/Build/nlp/Qwen/cli_demo.py", line 198, in main
    for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
  File "/home/marco/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat/modeling_qwen.py", line 1214, in stream_generator
    for token in self.generate_stream(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "/usr/local/lib/python3.10/dist-packages/transformers_stream_generator/main.py", line 969, in sample_stream
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

image

期望行为 | Expected Behavior

期待 70B能在 多卡环境下工作起来

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

jklj077 commented 9 months ago

Closed in favor of https://github.com/QwenLM/Qwen/issues/848. 请移步讨论。