当attention_mask是全1的时候表现正常

attention_mask: 66a87c43e2f6599fa5f36aeebeb469b 输出： 22230eeb6e230c81b1f8b7c653b2981

当attention_mask首位为0时，输出全为NaN值

attention_mask： dccb7165aacbab39c738f91976581c0 输出： c79dc4f98b3f8ec1c81dbc6ef042853

同时我也测试过，如果首位不是0，其他位置出现0，模型forward函数也不会输出NaN值

期望行为 | Expected Behavior

期望行为是，不论attention_mask如何，forward函数都不输出NaN值

复现方法 | Steps To Reproduce

如“当前行为”章节的截图所示，只需将attention_mask的首位置0即可（模仿left-padding的padding行为），然后调用model的forward函数就会得到对应全NaN的输出

运行环境 | Environment

- OS: Ubuntu 20.04.6 LTS (Focal Fossa)
- Python: 3.11.4
- Transformers: 4.39.3
- PyTorch: 2.1.2
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`): 12.1

备注 | Anything else?

No response

jklj077 commented 1 month ago

Hi, I'm not sure I understand your use case. May I know what results you were expecting? You have literally prevented the initial position from attending to itself and it should be expected that the model did not know what next token would be.

leileqiTHU commented 1 month ago

yeah sorry that I may not make it clear.

I was trying to use model.forward function directly rather than calling model.generate function, in order to observe its behavior in the forward pass. My input is of different lengths, so I have to pad them to the same lengths. I used left padding, prepending pad \<endoftext> tokens. In my opinion, those pad tokens should not be attended, and attention_mask is used in this scenario, setting those positions to 0 so that the model won't attend to those pad tokens in the forward pass. However, I got all NaN logits, which confuses me. I tried not to pass the attention_mask parameter, and there are no NaN values in the logits, which is I expected. So I infer that this may be the problem of the attention_mask. To further locate the problem, I tried different attention_masks, finally found out that If we set the first position to 0 (in which case the model won't attend to the first token which is a pad token), the return values of model.forward function , i.e., the logits, will all be NaN values.

Also, I tried Qwen1.5-7B-Chat model, and it does not have this problem, i.e., even if I set the attention_mask of the first position to 0, the output will still be free of NaN values. So I suspect that this may be a problem of Qwen-7B-Chat.

But also, I may make mistakes, please let me know if I do.

leileqiTHU commented 1 month ago

And If the masked tokens in the left positions should not know what the next token should be due to that they are prevented from attending to themselves, why are the logits of other un-masked positions (the right positions ) are also NaN values? Did I get it wrong?

jklj077 commented 1 month ago

Hi, after reading through your comments, and if I understood correctly, Qwen1.5 was working as you would expect. I would suggest just using Qwen1.5.

P.S.: Investigating the original issue is more complicated than it appeared. Was flash attention enabled? Were you following the instructions in README to do batch inference?

cageyoko commented 1 week ago

遇到了相同的问题，

不用flash-attn batchsize=1可以正常出结果 batchsize>1时候有padding的样本过模型后输出为nan
安装flash-attn 就好了

jklj077 commented 1 week ago

Hi, Qwen1.0 models and code will not be updated anymore. Please try Qwen2.0 instead.

QwenLM / Qwen

[BUG] model的forward函数接收attention_mask的时候，若attention_mask[i, 0]==0，则序列i输出的logits全都是NaN值 #1268

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior