为什么预训练时，做attention的时候不需要mask

DLLXW / baby-llama2-chinese

用于从头预训练+SFT一个小参数量的中文LLaMa2的仓库；24G单卡即可运行得到一个具备简单中文问答能力的chat-llama2.

MIT License

2.34k stars 288 forks source link

Closed LLH1818 closed 9 months ago

LLH1818 commented 9 months ago

初学者请教一下。为什么flash版本attention没有mask，manual版本也是score + mask，不应该是score * mask吗