Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset

McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'

https://mcgill-nlp.github.io/llm2vec/

MIT License

1.13k stars 83 forks source link

Unable to Reproduce LLM2Vec Training Results Using GradCache on Echo Dataset #135

Open viet-data opened 1 month ago

viet-data commented 1 month ago

I have been attempting to reproduce the training results on the same echo data. Due to hardware limitations, I had to reimplement the training process using GradCache.

Although my model code can load the LLM2Vec public checkpoint and perform inference correctly, I am unable to achieve comparable performance to LLM2Vec when training a bidirectional Mistral model (without MNTP and unsupervised SimCSE) using GradCache. My training used a batch size of 512 on the echo dataset and stopped after 750 iterations.

Specifically, on the STS tasks, I have not been able to exceed 75 on SICK-R and 65 on STS-12 (other tasks also show low performance, except for BIOSSES).

Has anyone else tried to train LLM2Vec with GradCache, or has anyone successfully reproduced the LLM2Vec results using the original code? Any insights or suggestions would be greatly appreciated.

vaibhavad commented 1 month ago

Hello @viet-data,

where you able to reproduce with GradCache? If you are interested, we'll like to integrate GradCache to the LLM2Vec library

viet-data commented 1 month ago

Hi @vaibhavad ,

I have successfully trained with gradcache, using a batch size of 128, and achieved results close to those reported in llm2vec. However, I'm curious about llm2vec's performance when scaling up the data. I haven't been able to improve performance with more training data, which might be due to the smaller batch size.

Could you share the llm2vec results when training on the full dataset? It would be very useful if you could integrate gradcache into llm2vec to help us train with fewer GPUs. Thank you.

stefanhgm commented 1 month ago

Hi @viet-data,

I reproduced the Llama 3 supervised version trained 1000 steps on the MNTP task and 1000 steps on the E5 dataset (according to the original LLM2Vec) training configs. I currently run the full MTEB evaluation, but the first results look very similar to the ones reported on HuggingFace for the model.

I am currently training a Llama 3.1 version with the same training recipe.

viet-data commented 1 month ago

@stefanhgm Thanks so much for sharing! I agree, llm2vec seems quite reproducible. Excited to see your results with Llama 3.1

stefanhgm commented 1 month ago

Currently, the evaluation of the Llama 3.1 version on MTEB hangs on a task where repeatedly 391 batches are processed. It already repeats this for over a day now. I think it is the DBPedia task as the CQADupstackWordpressRetrieval and ClimateFEVER tasks were completed last and DBPedia should come next.

@vaibhavad any chance you observed a similar behavior when evaluating on MTEB?