OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.42k stars 305 forks source link

Flash Attention regurgitates repeated tokens - seq2seq #1752

Open ArtanisTheOne opened 3 months ago

ArtanisTheOne commented 3 months ago

Having some generation issues with NMT models trained with OpenNMT-py, which include OpenNMT-py versions before flash attention existed and one I'm currently training with the most recent which include flash attention. Models were converted using onmt_release_model with storage quantization set to int8.

Happens when turning flash_attention=True on when creating the ctranslate.Translator object. GPU is a RTX 3090.

Don't know if this is just an arch issue or something to do with the conversion process from opennmt-py.

Examples of some outputs given the Flores200 benchmark

sss of of of of of of of of of of of
sss                                                     in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in in          in in in in in in   patients patients patients in in in in in in in  in in in in in   in in in in in in in in in in in in in in in in in in in in in in patients in in in in in      in in in in in in in in patients patients patients patients patients patients patients patients patients patients patients patients patients in in patients patients patients patients patients in countries countries in in in in in in in in in             in in in in in in in in in in     in in in in in
ssss
ssmmmmmmmm
ss
__opt_src_en__opt_src_en__opt_src_en
sss
sss                       of of of of of of of of of                         of of
sss                                                tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax                  tax tax tax tax tax tax tax tax tax tax tax tax tax     tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax tax

sssmmmmmmmmmmmmmmmmmmm
minhthuc2502 commented 3 months ago

It should work with old or new version of Onmt-py. I don't have enough information to help you sorry.

FYI, I will disable flash attention feature in the future ctranslate2 version because of it does not improve much the performance with the inference and make the package quite a lot heavier.