IEIT-Yuan / Yuan-2.0

Yuan 2.0 Large Language Model
Other
677 stars 85 forks source link

[BUG]Huggingface版本推理流式输出报错 #94

Open cauwulixuan opened 7 months ago

cauwulixuan commented 7 months ago

我在用以下代码进行流式推理的时候,参考fastchat-inference.py 的这一部分stream_generate

for i in range(max_new_tokens):
    if i == 0:  # prefill
        out = model(input_ids=start_ids, use_cache=True)
        logits = out.logits
        past_key_values = out.past_key_values
        ...
    else:  # decoding
        out = model(
            input_ids=torch.as_tensor(
                [[token] if not sent_interrupt else output_ids],
                device=device,
            ),
            use_cache=True,
            past_key_values=past_key_values if not sent_interrupt else None,
        )
        sent_interrupt = False
        logits = out.logits
        past_key_values = out.past_key_values
    ...
    probs = torch.softmax(last_token_logits, dim=-1)
    indices = torch.multinomial(probs, num_samples=2)
    tokens = [int(token) for token in indices.tolist()]
    token = tokens[0]
    output_ids.append(token)
    ...
...
  1. use_flash_attention=True的时候,是可以正常推理的;
  2. use_flash_attention=False的时候,报错了,报错信息如下:
    You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
    {'torch_dtype': torch.float16, 'revision': 'main'}
    YuanForCausalLM(
    (model): YuanModel(
    (embed_tokens): Embedding(135040, 2048, padding_idx=77185)
    (layers): ModuleList(
      (0-23): 24 x YuanDecoderLayer(
        (self_attn): YuanAttention(
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
          (lf_gate): LocalizedFiltering(
            (conv1): Conv2d(2048, 1024, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
            (conv2): Conv2d(1024, 2048, kernel_size=(2, 1), stride=(1, 1), padding=(1, 0))
            (output_layernorm): LlamaRMSNorm()
          )
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): YuanMLP(
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
    )
    (lm_head): Linear(in_features=2048, out_features=135040, bias=False)
    )
    user: yuan2.0是谁开发的?
    assistant: Traceback (most recent call last):
    File "<frozen runpy>", line 198, in _run_module_as_main
    File "<frozen runpy>", line 88, in _run_code
    File "/github/FastChat/fastchat/serve/cli.py", line 304, in <module>
    main(args)
    File "/github/FastChat/fastchat/serve/cli.py", line 227, in main
    chat_loop(
    File "/github/FastChat/fastchat/serve/inference.py", line 532, in chat_loop
    outputs = chatio.stream_output(output_stream)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/github/FastChat/fastchat/serve/cli.py", line 63, in stream_output
    for outputs in output_stream:
    File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
    File "/github/FastChat/fastchat/serve/inference.py", line 160, in generate_stream
    out = model(
          ^^^^^^
    File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 938, in forward
    outputs = self.model(
              ^^^^^^^^^^^
    File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 768, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
    File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 426, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
    File "/opt/conda/envs/fc/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/.cache/huggingface/modules/transformers_modules/yuan_hf_model.py", line 358, in forward
    raise ValueError(
    ValueError: Attention mask should be of size (1, 1, 1, 10), but is torch.Size([1, 1, 1, 1]

是否是和yuan_hf_model.py脚本里相关模块的处理有关?

我上述使用推理脚本还是比较常见的,所以如果可以的话,是否可以修复这个问题?

ljg-ieisystem commented 7 months ago

在开发huggingface没有考虑到这种推理情况,可以通过yuan_hf_model.py 中更改以下代码段解决,之后我们会考虑该情况更新对应的代码

if self.training or self.reset_position_ids and attention_mask is not None:
            attention_mask, _ = self._prepare_decoder_attention_mask_training(input_ids1, inputs_embeds, self.eod_token, reset_mask_flag, self.reset_attention_mask, self.reset_position_ids)
cauwulixuan commented 7 months ago

这种推理情况似乎还挺常见的,我本地修改了这个文件,确实可以了,谢谢您。

这种情况仅限于我已经下载了huggingface的模型。如果直接from_pretrained("IEITYuan/Yuan2-2B-hf"),就没法手工修改了吧?后续官方会统一更新吗?

Shawn-IEITSystems commented 7 months ago

@ljg-ieisystem