root@f24a8b4b662d:/home/Telechat5/inference_telechat# python telechat_infer_demo.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:12<00:00, 2.66it/s]
**多轮输入演示**
提问: 你是谁?
Traceback (most recent call last):
File "/home/Telechat5/inference_telechat/telechat_infer_demo.py", line 65, in
main()
File "/home/Telechat5/inference_telechat/telechat_infer_demo.py", line 27, in main
answer, history = model.chat(tokenizer=tokenizer, question=question, history=[], generation_config=generate_config,
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 878, in chat
outputs = self.generate(inputs.to(self.device), generation_config=generation_config, model_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, *kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate
return self.greedy_search(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 799, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 716, in forward
outputs = block(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 540, in forward
attn_outputs = self.self_attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 460, in forward
context_layer = self.core_attention_flash(q, k, v)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 210, in forward
output = flash_attn_unpadded_func(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 529, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 288, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask = _flash_attn_varlen_forward(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
root@f24a8b4b662d:/home/Telechat5/inference_telechat# python telechat_infer_demo.py Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:12<00:00, 2.66it/s] **多轮输入演示** 提问: 你是谁? Traceback (most recent call last): File "/home/Telechat5/inference_telechat/telechat_infer_demo.py", line 65, in
main()
File "/home/Telechat5/inference_telechat/telechat_infer_demo.py", line 27, in main
answer, history = model.chat(tokenizer=tokenizer, question=question, history=[], generation_config=generate_config,
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 878, in chat
outputs = self.generate(inputs.to(self.device), generation_config=generation_config, model_kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, *kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1522, in generate
return self.greedy_search(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2339, in greedy_search
outputs = self(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 799, in forward
transformer_outputs = self.transformer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 716, in forward
outputs = block(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 540, in forward
attn_outputs = self.self_attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 460, in forward
context_layer = self.core_attention_flash(q, k, v)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/telechat-7B/modeling_telechat.py", line 210, in forward
output = flash_attn_unpadded_func(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 529, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 288, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask = _flash_attn_varlen_forward(
File "/opt/conda/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.