Eager/Spda Attention have lower results compared to Flash Attention in simcse stage

ThonyPan commented 2 months ago

Hi @vaibhavad,

I tried to reproduce the simcse stage of the framework. While using flash attention, the results are as good as reported. However, when trying to train on eager or spda attention, the outcome has a significant drop. What might be the potential reason?

In your code, you describe an error as "'LLM2Vec models were trained with flash attention enabled. For optimal performance, please install the flash_attn package with pip install flash-attn --no-build-isolation." Does that mean if I train the mntp stage using eager/spda attention as well, the performance would be on par with the flash attention?

Thank you!

vaibhavad commented 1 month ago

Hi @ThonyPan,

Unfortunately we have not run experiments comparing different attention implementations, so I cannot say anything about performance differences. We chose flash attention as it is the fastest, and latency is crucial for both training and inference.

TianBaoGe commented 2 weeks ago

Hi @ThonyPan

Did you successfully reproduce the results of MNTP+SimCSE? I have successfully reproduced Sheared-LLaMA-1.3B SimCSE, but the results of MNTP+SimCSE are consistently lower than those reported in the paper. Could you share your training details?

McGill-NLP / llm2vec

Eager/Spda Attention have lower results compared to Flash Attention in simcse stage #144