Closed leileqiTHU closed 1 week ago
Hi, I'm not sure I understand your use case. May I know what results you were expecting? You have literally prevented the initial position from attending to itself and it should be expected that the model did not know what next token would be.
yeah sorry that I may not make it clear.
I was trying to use model.forward function directly rather than calling model.generate function, in order to observe its behavior in the forward pass. My input is of different lengths, so I have to pad them to the same lengths. I used left padding, prepending pad \<endoftext> tokens. In my opinion, those pad tokens should not be attended, and attention_mask is used in this scenario, setting those positions to 0 so that the model won't attend to those pad tokens in the forward pass. However, I got all NaN logits, which confuses me. I tried not to pass the attention_mask parameter, and there are no NaN values in the logits, which is I expected. So I infer that this may be the problem of the attention_mask. To further locate the problem, I tried different attention_masks, finally found out that If we set the first position to 0 (in which case the model won't attend to the first token which is a pad token), the return values of model.forward function , i.e., the logits, will all be NaN values.
Also, I tried Qwen1.5-7B-Chat model, and it does not have this problem, i.e., even if I set the attention_mask of the first position to 0, the output will still be free of NaN values. So I suspect that this may be a problem of Qwen-7B-Chat.
But also, I may make mistakes, please let me know if I do.
And If the masked tokens in the left positions should not know what the next token should be due to that they are prevented from attending to themselves, why are the logits of other un-masked positions (the right positions ) are also NaN values? Did I get it wrong?
Hi, after reading through your comments, and if I understood correctly, Qwen1.5 was working as you would expect. I would suggest just using Qwen1.5.
P.S.: Investigating the original issue is more complicated than it appeared. Was flash attention enabled? Were you following the instructions in README to do batch inference?
遇到了相同的问题,
Hi, Qwen1.0 models and code will not be updated anymore. Please try Qwen2.0 instead.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
qwen-7b-chat调用forward函数的时候,如果attention_mask的第一个元素是0,输出就会是Nan值
当attention_mask是全1的时候表现正常
attention_mask: 输出:
当attention_mask首位为0时,输出全为NaN值
attention_mask: 输出:
同时我也测试过,如果首位不是0,其他位置出现0,模型forward函数也不会输出NaN值
期望行为 | Expected Behavior
期望行为是,不论attention_mask如何,forward函数都不输出NaN值
复现方法 | Steps To Reproduce
如“当前行为”章节的截图所示,只需将attention_mask的首位置0即可(模仿left-padding的padding行为),然后调用model的forward函数就会得到对应全NaN的输出
运行环境 | Environment
备注 | Anything else?
No response