[BUG]Huggingface版本推理流式输出报错

cauwulixuan commented 7 months ago

我在用以下代码进行流式推理的时候，参考fastchat-inference.py 的这一部分stream_generate

for i in range(max_new_tokens):
    if i == 0:  # prefill
        out = model(input_ids=start_ids, use_cache=True)
        logits = out.logits
        past_key_values = out.past_key_values
        ...
    else:  # decoding
        out = model(
            input_ids=torch.as_tensor(
                [[token] if not sent_interrupt else output_ids],
                device=device,
            ),
            use_cache=True,
            past_key_values=past_key_values if not sent_interrupt else None,
        )
        sent_interrupt = False
        logits = out.logits
        past_key_values = out.past_key_values
    ...
    probs = torch.softmax(last_token_logits, dim=-1)
    indices = torch.multinomial(probs, num_samples=2)
    tokens = [int(token) for token in indices.tolist()]
    token = tokens[0]
    output_ids.append(token)
    ...
...

use_flash_attention=True的时候，是可以正常推理的；

use_flash_attention=False的时候，报错了，报错信息如下：

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
{'torch_dtype': torch.float16, 'revision': 'main'}
YuanForCausalLM(
(model): YuanModel(
(embed_tokens): Embedding(135040, 2048, padding_idx=77185)
(layers): ModuleList(
  (0-23): 24 x YuanDecoderLayer(
    (self_attn): YuanAttention(
      (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
      (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
      (rotary_emb): LlamaRotaryEmbedding()
      (lf_gate): LocalizedFiltering(
        (conv1): Conv2d(2048, 1024, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
        (conv2): Conv2d(1024, 2048, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
        (output_layernorm): LlamaRMSNorm()
      )
      (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
      (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
    )
    (mlp): YuanMLP(
      (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
      (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
      (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
      (act_fn): SiLU()
    )
    (input_layernorm): LlamaRMSNorm()
    (post_attention_layernorm): LlamaRMSNorm()
  )
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=2048, out_features=135040, bias=False)
)
user: yuan2.0是谁开发的？
assistant: Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/github/FastChat/fastchat/serve/cli.py", line 304, in <module>
main(args)
File "/github/FastChat/fastchat/serve/cli.py", line 227, in main
chat_loop(
File "/github/FastChat/fastchat/serve/inference.py", line 532, in chat_loop
outputs = chatio.stream_output(output_stream)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/github/FastChat/fastchat/serve/cli.py", line 63, in stream_output
for outputs in output_stream:
File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
response = gen.send(request)
           ^^^^^^^^^^^^^^^^^
File "/github/FastChat/fastchat/serve/inference.py", line 160, in generate_stream
out = model(
      ^^^^^^
File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 938, in forward
outputs = self.model(
          ^^^^^^^^^^^
File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 768, in forward
layer_outputs = decoder_layer(
                ^^^^^^^^^^^^^^
File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 426, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                      ^^^^^^^^^^^^^^^
File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 358, in forward
raise ValueError(
ValueError: Attention mask should be of size (1, 1, 1, 10), but is torch.Size([1, 1, 1, 1]

是否是和yuan_hf_model.py脚本里相关模块的处理有关？

我上述使用推理脚本还是比较常见的，所以如果可以的话，是否可以修复这个问题？

ljg-ieisystem commented 7 months ago

在开发huggingface没有考虑到这种推理情况，可以通过yuan_hf_model.py 中更改以下代码段解决，之后我们会考虑该情况更新对应的代码

if self.training or self.reset_position_ids and attention_mask is not None：
            attention_mask, _ = self._prepare_decoder_attention_mask_training(input_ids1, inputs_embeds, self.eod_token, reset_mask_flag, self.reset_attention_mask, self.reset_position_ids)

cauwulixuan commented 7 months ago

这种推理情况似乎还挺常见的，我本地修改了这个文件，确实可以了，谢谢您。

这种情况仅限于我已经下载了huggingface的模型。如果直接from_pretrained("IEITYuan/Yuan2-2B-hf")，就没法手工修改了吧？后续官方会统一更新吗？

Shawn-IEITSystems commented 7 months ago

@ljg-ieisystem

IEIT-Yuan / Yuan-2.0

[BUG]Huggingface版本推理流式输出报错 #94