Open MohammadAsadolahi opened 8 months ago
More research is needed to establish the exact benefit from document instructions. Here's the part from the paper that discusses it a bit:
Embedding dataset We benchmark MEDI [ 143 ], a new version of MEDI with better negatives
which we build and call MEDI2, and the E5 dataset [160 ]. While MEDI and MEDI2 always
preface instructions with “Represent” (see e.g. Figure 10), the E5 dataset places no constraint
on the instruction prefix (see e.g. Figure 11). Thus, when using the E5 dataset the “<|embed|>”
formatting is critical to tell the model that it will be subject to the representation loss, not the
generative loss (Figure 3). Further, MEDI and MEDI2 always contain instructions for both queries
and documents, which we refer to as two-sided instructions. Meanwhile, the E5 dataset uses onesided instructions for asymmetric datasets [104 ], whereby the documents receive no instructions,
only the queries. The advantage of not using document instructions is that the document corpus
can be encoded once and then cached and reused across a variety of tasks. During training on E5,
symmetric tasks are also in a one-sided setting, but we still evaluate them in the two-sided format.
This should not be a problem as the cosine similarity function we use during training is transitive:
if sentence A with instruction is similar to sentence B without instruction, and sentence B without
instruction is similar to sentence C with instruction, then we can confidently say that sentence A
with instruction is also similar to sentence C with instruction. As depicted in Table 6, using the E5
dataset performs best by a wide margin. An inspection of samples, suggests that this is likely due to
its superior hard negatives and diversity of tasks generated by GPT-4 (Appendix N). For our final
runs with the E5 dataset, we additionally add scientific data (§3.1).
Okay so use instructions for document retrieval, just for the query embedding side not the document embedding side. Thanks for the excerpt I understand the one sided instructions now.
Do you have any other recommendations for finetuning the existing grit model for embedding only?
Exactly. You can also use them for the document embedding side if you want, but the benefit is unclear to me. Would be interesting to know! If you are only interested in embedding performance, I would probably fine-tune from the embedding-only variant instead: https://huggingface.co/GritLM/emb_m7_nodes16_fast
Other than that, I'd follow the recommendations in the paper (bidirectional attn, large batch size etc)
Hi and thank you for sharing this amazing work.
i want to use Gritlm to produce embeddings to be stored in some vector database for document retrieval. But. there are many models on the huggingface.