QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] <Qwen最新版本存在causal_mask bug,导致在有kv_cache的情况下,多个tokens的输入会得到不同的结果> #979

Closed hzjane closed 7 months ago

hzjane commented 8 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

[14880, 107485, 103929, 113272, 100178, 271, 18493] 这是最新的qwen-14-chat在某个输入的情况下,直接调用.generate()依次单个生成的总共7个tokens,如果我在第三轮的input是[102939]时,手动把input设置成test(5个token),再去调用会生成和单次调用不同的结果. 271 不等于 3837,18493不等于100345.

  test = torch.tensor([[[103929, 113272, 100178, 271, 18493]]])
  model_inputs['input_ids'] = test
  outputs = self(
      **model_inputs,
      return_dict=True,
      output_attentions=output_attentions,
      output_hidden_states=output_hidden_states,
  )
#new-input_ids tensor([[[103929, 113272, 100178,    271,  18493]]])
#next_tokens:tensor([[[113272, 100178,   3837, 100345,  99699]]])

期望行为 | Expected Behavior

预期行为生成的除了新的token外应该都相同,查看了qwen的modeling_qwen.py的commit history,发现是causal_mask的改动出了问题,这样会导致causal_mask 一直为None,如果有多个token同时input,又有kv_cache时得到的结果和预期结果不符合. 我改动了目前的判断条件,和加上causal_mask的截取后可以得到正常的output.

new-input_ids tensor([[[103929, 113272, 100178,    271,  18493]]])
next_tokens:tensor([[[113272, 100178,    271,  18493,  99699]]])

这是我的改动的diff

501,507c501,503
<             key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
<             if query.size(1) == key_size:
<                 causal_mask = torch.tril(
<                     torch.ones((key_size, key_size), dtype=torch.bool, device=query.device)
<                 ).view(1, 1, key_size, key_size)
<             else:
<                 causal_mask = None
---
>             causal_mask = torch.tril(
>                 torch.ones((key_size, key_size), dtype=torch.bool, device=query.device)
>             ).view(1, 1, key_size, key_size)
519a516,520
>
>             causal_mask = causal_mask[
>                 :, :, key.size(-2) - query.size(-2): key.size(-2), :key.size(-2)
>             ]
>

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

jklj077 commented 8 months ago

没太看懂这个用例欸。 [14880, 107485, 103929, 113272, 100178, 271, 18493]这个已经是输出token了吧? generate中生成每个step都是封装好的,对外是不暴露的,是怎么改单个step的输入的?单个step的输入长度默认为1,你这个单个step输入好像长度为5了。

hzjane commented 8 months ago

Contributor

在这里进行修改,https://github.com/huggingface/transformers/blob/v4.36.2/src/transformers/generation/utils.py#L2576

  model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
  test = torch.tensor([[[103929, 113272, 100178, 271, 18493]]])
  model_inputs['input_ids'] = test
  outputs = self(
      **model_inputs,
      return_dict=True,
      output_attentions=output_attentions,
      output_hidden_states=output_hidden_states,
  )

也就是说我把qwen原本一个个生成的token(分两次跑),在第二次一下子输入多个第一次生成的token,会得到不同的结果。在这次commit前没有问题, 原因是你们修改causal_mask导致现在不支持有n个token input了。但是hf的assitant_generate是会有同时输入n个tokens这种情况出现的,所以会导致出问题。目前测过llama,baichuan,chatglm都没这个问题。

hzjane commented 8 months ago

还有看了llama的causal_mask的调用https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L735, 应该在q_len > 1 的情况下就会使用。或许下面这个改法是最简单的改法。把使用causal_mask 的情况完善,不仅只有first init 的时候用到。

502c502
<             if query.size(1) == key_size:
---
>             if query.size(1) > 1:
505a506,508
>                 causal_mask = causal_mask[
>                     :, :, key.size(-2) - query.size(-2): key.size(-2), :key.size(-2)
>                 ]
wells-Qiang-Chen commented 8 months ago

改过之后的完整代码可以发我一份吗?我发现模型输出不稳定,想试一下有没有效果

jklj077 commented 7 months ago

In the recent Qwen1.5 release, its codebase has been integrated into the transformers package, aligning with the established practices of the transformers library. This integration signifies that the transformers ecosystem is now designed to natively support Qwen1.5 models in most scenarios without any additional configuration, including the assistant_generate functionality.

For more information about this integration, discussions on how to leverage Qwen1.5 within the transformers environment, and for updates on community feedback and enhancements, please visit the official Qwen1.5 GitHub repository at https://github.com/QwenLM/Qwen1.5.