KennethEnevoldsen / scandinavian-embedding-benchmark

A Scandinavian Benchmark for sentence embeddings
https://kennethenevoldsen.github.io/scandinavian-embedding-benchmark/
MIT License
27 stars 3 forks source link

add model voyageai/voyage-multilingual-2 #177

Closed noterat closed 3 months ago

noterat commented 3 months ago

Add benchmark for model /voyageai/voyage-multilingual-2/.

Huggingface modelcard: https://huggingface.co/voyageai/voyage-multilingual-2

Model info: https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/

KennethEnevoldsen commented 3 months ago

Thanks - will make sure to add it to the benchmark!

noterat commented 3 months ago

Let me know if you want me to add it as a contribution to the code. I looked through the repo and I guess it would be very similar to .../openai_models.py.

KennethEnevoldsen commented 3 months ago

@noterat I would be very happy than happy to accept a PR

KennethEnevoldsen commented 3 months ago

@noterat I have added the model in #178. Once the models are done running it can be pushed to main and the leaderboard updated.

A bit of extra information:

The implementation follows that of MTEB (see model implementations in mteb here). In the future, it will likely transition to using the MTEB implementation once the updates that we are currently working on for MTEB are finalized.

noterat commented 3 months ago

Thank you for adding the model, it was helpful for me when choosing between voyageai and openai.

I will check out MTEB and try to add a future model as a PR.

KennethEnevoldsen commented 3 months ago

Out of curiosity, which one did you end up choosing and why? Just wondering if there is other trade-offs that people consider when selecting which is not in the benchmark yet

noterat commented 3 months ago

I ended up using VoyageAI because it looked like it was the benchmark language identification that lowered the average score. My use case is creating embeddings on Swedish so I guess that benchmark isn't relevant for me. But I haven't read up on that benchmark so it's a guess from my part.

Besides the models capabilities I also had to base my choice on the rate limits. My dataset is about 250 million tokens so I wanted something that is cheap and fast. And honestly this was the main driver. VoyageAI had better rate limits (300 requests/min & 1 M token/min) and the first 50 M tokens are free. Compare that to OpenAI that had 3 M token per day - and that is in batch mode.

So I come from an operational perspective where price and speed are a large part of the puzzle.

KennethEnevoldsen commented 3 months ago

Thanks this was exactly what I was hoping for. Good to get some of the other important decisions incorporated.

My use case is creating embeddings on Swedish so I guess that benchmark isn't relevant for me.

The benchmark has a Swedish subsection. I see that the language identification might not be as relevant if you know the target language (might remove that in the future)

Besides the models capabilities I also had to base my choice on the rate limits. My dataset is about 250 million tokens so I wanted something that is cheap and fast. And honestly this was the main driver. VoyageAI had better rate limits (300 requests/min & 1 M token/min) and the first 50 M tokens are free. Compare that to OpenAI that had 3 M token per day - and that is in batch mode.

Yea my intuition is that ease-of-use matters notably more once the embedding quality is adequate. Will try to see if I can find the time to incorporate a table comparing the APIs into the interface as well.