Try out other small multi-lang embedding models

StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.

https://akkerman.ai/FluentAI/

MIT License

9 stars 1 forks source link

Try out other small multi-lang embedding models #54

Closed StephanAkkerman closed 1 week ago

StephanAkkerman commented 2 weeks ago

Description:
- Problem: We are currently using fasttext, which is good for OOV words, but it's a bit slow and old (2015-2020). There might be better models available.
- Solution: Check out the leaderboard: https://huggingface.co/spaces/mteb/leaderboard and try out models that aren't too large.
- Prerequisites: [List any requirements or dependencies needed before starting.]
Tasks:
- Try out newer embedding models and evaluate them.
- Add their speed to the eval overview
- Run the eval and save results somewhere (model name, hyperparameters etc.)
- Remove Glove (and scipy) from eval
- Maybe add the models to the config
- Do we need to know what scale the model uses for similarity?
- Clean up the models after use to free up VRAM
Additional context Add any other context or screenshots about the feature request here.

StephanAkkerman commented 2 weeks ago

https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#multilingual-models

FastText has many languages support, preferably we find a model that has more than 5 languages support. We can offer some accuracy if the speed increases.

StephanAkkerman commented 2 weeks ago

https://huggingface.co/arkohut/jina-embeddings-v3: 2GB memory, rank 25 https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1: 2GB, 32 https://huggingface.co/intfloat/multilingual-e5-large-instruct: 2GB, rank 48 https://huggingface.co/Alibaba-NLP/gte-multilingual-base: 1GB, rank 79 https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2: 1GB, 143 https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2: 0.5GB, 148 https://huggingface.co/intfloat/multilingual-e5-small: 0.5GB, 117 (best <250M model)

https://huggingface.co/BAAI/bge-m3: 2GB (low ranked)

StephanAkkerman commented 2 weeks ago

FastText: Word vectors for 157 languages

StephanAkkerman commented 1 week ago

https://huggingface.co/intfloat/multilingual-e5-small: all scores are between 0.7 and 1.0 which means we need to find a method to scale it back to something more useful for us