ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
479 stars 33 forks source link

E5 dataset #17

Open wangskyGit opened 3 months ago

wangskyGit commented 3 months ago

Hello! This is awesome work and the idea of using LLM as the embedding model is amazing. More importantly, you really did it and the performance is surprising good! I am wondering do you plan to release the E5 synthetic dataset generated by GPT4? or what will the performance be like if we only leverage the open dataset?

Muennighoff commented 3 months ago

Unfortunately, we are unable to release the E5 dataset. We have released the MEDI2 dataset. The table in the screenshot from the paper gives you an idea of their performance difference.

Screenshot 2024-03-20 at 10 31 36 AM