File "/home/marco/Build/nlp/Qwen/cli_demo.py", line 210, in <module>
main()
File "/home/marco/Build/nlp/Qwen/cli_demo.py", line 198, in main
for response in model.chat_stream(tokenizer, query, history=history, generation_config=config):
File "/home/marco/.cache/huggingface/modules/transformers_modules/Qwen-14B-Chat/modeling_qwen.py", line 1214, in stream_generator
for token in self.generate_stream(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 56, in generator_context
response = gen.send(request)
File "/usr/local/lib/python3.10/dist-packages/transformers_stream_generator/main.py", line 969, in sample_stream
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
当前行为 | Current Behavior
四张 3090/24GB
模型下载了Qwen-14B-Chat Qwen-14B-Chat-Int8 Qwen-72B-Chat-Int4 Qwen-72B-Chat-Int8 Qwen-7B-Chat,
除了72b的几个无法在单卡上运行外, 其它模型用单卡都可以完整运行。
/////////////
一旦到多卡(无论是两个还是四个), 模型加载都可以, nvidia-smi 可以看到显存几乎被分摊,但是一旦运行chat或者其它实际的推理过程,就会出错,docker环境和 native环境都试过。
期望行为 | Expected Behavior
期待 70B能在 多卡环境下工作起来
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
备注 | Anything else?
No response