Open ThonyPan opened 2 months ago
Hi @ThonyPan,
Unfortunately we have not run experiments comparing different attention implementations, so I cannot say anything about performance differences. We chose flash attention as it is the fastest, and latency is crucial for both training and inference.
Hi @ThonyPan
Did you successfully reproduce the results of MNTP+SimCSE? I have successfully reproduced Sheared-LLaMA-1.3B SimCSE, but the results of MNTP+SimCSE are consistently lower than those reported in the paper. Could you share your training details?
Hi @vaibhavad,
I tried to reproduce the simcse stage of the framework. While using flash attention, the results are as good as reported. However, when trying to train on eager or spda attention, the outcome has a significant drop. What might be the potential reason?
In your code, you describe an error as "'LLM2Vec models were trained with flash attention enabled. For optimal performance, please install the
flash_attn
package withpip install flash-attn --no-build-isolation
." Does that mean if I train the mntp stage using eager/spda attention as well, the performance would be on par with the flash attention?Thank you!