Open ArtanisTheOne opened 3 months ago
It should work with old or new version of Onmt-py. I don't have enough information to help you sorry.
FYI, I will disable flash attention
feature in the future ctranslate2 version because of it does not improve much the performance with the inference and make the package quite a lot heavier.
Having some generation issues with NMT models trained with OpenNMT-py, which include OpenNMT-py versions before flash attention existed and one I'm currently training with the most recent which include flash attention. Models were converted using onmt_release_model with storage quantization set to int8.
Happens when turning flash_attention=True on when creating the ctranslate.Translator object. GPU is a RTX 3090.
Don't know if this is just an arch issue or something to do with the conversion process from opennmt-py.
Examples of some outputs given the Flores200 benchmark