Closed StephanAkkerman closed 1 week ago
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#multilingual-models
FastText has many languages support, preferably we find a model that has more than 5 languages support. We can offer some accuracy if the speed increases.
https://huggingface.co/arkohut/jina-embeddings-v3: 2GB memory, rank 25 https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1: 2GB, 32 https://huggingface.co/intfloat/multilingual-e5-large-instruct: 2GB, rank 48 https://huggingface.co/Alibaba-NLP/gte-multilingual-base: 1GB, rank 79 https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2: 1GB, 143 https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2: 0.5GB, 148 https://huggingface.co/intfloat/multilingual-e5-small: 0.5GB, 117 (best <250M model)
https://huggingface.co/BAAI/bge-m3: 2GB (low ranked)
FastText: Word vectors for 157 languages
https://huggingface.co/intfloat/multilingual-e5-small: all scores are between 0.7 and 1.0 which means we need to find a method to scale it back to something more useful for us
Description:
Problem: We are currently using fasttext, which is good for OOV words, but it's a bit slow and old (2015-2020). There might be better models available.
Solution: Check out the leaderboard: https://huggingface.co/spaces/mteb/leaderboard and try out models that aren't too large.
Prerequisites: [List any requirements or dependencies needed before starting.]
Tasks:
Additional context Add any other context or screenshots about the feature request here.