Hello! This is awesome work and the idea of using LLM as the embedding model is amazing. More importantly, you really did it and the performance is surprising good!
I am wondering do you plan to release the E5 synthetic dataset generated by GPT4? or what will the performance be like if we only leverage the open dataset?
Unfortunately, we are unable to release the E5 dataset. We have released the MEDI2 dataset. The table in the screenshot from the paper gives you an idea of their performance difference.
Hello! This is awesome work and the idea of using LLM as the embedding model is amazing. More importantly, you really did it and the performance is surprising good! I am wondering do you plan to release the E5 synthetic dataset generated by GPT4? or what will the performance be like if we only leverage the open dataset?