Open massi-ang opened 11 months ago
Cohere's Rerank 3 is multilingual @massi-ang, supporting over 100 languages. It also has 4k context length, whereas cross-encoder/ms-marco-MiniLM-L-12-v2 has a max context length of 512.
This project already supports a wide range of external APIs as options for the text-text models, so why not also include the option for folks to include their Cohere API key and use Rerank 3?
I already did a quick comparison between Rerank 2 and marco MiniLM, and Rerank 2 ranked the results much better: with my query asking about image generation regulations, Rerank 2 put image-generated related responses at the top of the list, whereas marco's responses did not include anything about image generation.
I could not find an open-source model with as much language support as Cohere Rerank 3.
If I add support for Cohere Rerank 3, would that resolve this issue @massi-ang?
While cross encoders have shown better performance than using cosine similarity scores on sentence embeddings, there are no multilingual cross encoders, making this solution only viable for English. Experiments show that an English trained cross-encoder does not capture semantic meaning in languages other than English but only word similarity.
This feature suggests to add using embedding models and normalized vector similarity (like cosine similarity) between pairs of passages to score the semantic relevance of passages to queries.