McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
1.21k stars 92 forks source link

Eager/Spda Attention have lower results compared to Flash Attention in simcse stage #144

Open ThonyPan opened 1 month ago

ThonyPan commented 1 month ago

Hi @vaibhavad,

I tried to reproduce the simcse stage of the framework. While using flash attention, the results are as good as reported. However, when trying to train on eager or spda attention, the outcome has a significant drop. What might be the potential reason?

In your code, you describe an error as "'LLM2Vec models were trained with flash attention enabled. For optimal performance, please install the flash_attn package with pip install flash-attn --no-build-isolation." Does that mean if I train the mntp stage using eager/spda attention as well, the performance would be on par with the flash attention?

Thank you!

vaibhavad commented 1 day ago

Hi @ThonyPan,

Unfortunately we have not run experiments comparing different attention implementations, so I cannot say anything about performance differences. We chose flash attention as it is the fastest, and latency is crucial for both training and inference.