StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.
https://akkerman.ai/FluentAI/
MIT License
9 stars 1 forks source link

Try out other small multi-lang embedding models #54

Closed StephanAkkerman closed 1 week ago

StephanAkkerman commented 2 weeks ago
  1. Description:

    • Problem: We are currently using fasttext, which is good for OOV words, but it's a bit slow and old (2015-2020). There might be better models available.

    • Solution: Check out the leaderboard: https://huggingface.co/spaces/mteb/leaderboard and try out models that aren't too large.

    • Prerequisites: [List any requirements or dependencies needed before starting.]

  2. Tasks:

    • Try out newer embedding models and evaluate them.
    • Add their speed to the eval overview
    • Run the eval and save results somewhere (model name, hyperparameters etc.)
    • Remove Glove (and scipy) from eval
    • Maybe add the models to the config
    • Do we need to know what scale the model uses for similarity?
    • Clean up the models after use to free up VRAM
  3. Additional context Add any other context or screenshots about the feature request here.

StephanAkkerman commented 2 weeks ago

https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#multilingual-models

FastText has many languages support, preferably we find a model that has more than 5 languages support. We can offer some accuracy if the speed increases.

StephanAkkerman commented 2 weeks ago

https://huggingface.co/arkohut/jina-embeddings-v3: 2GB memory, rank 25 https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1: 2GB, 32 https://huggingface.co/intfloat/multilingual-e5-large-instruct: 2GB, rank 48 https://huggingface.co/Alibaba-NLP/gte-multilingual-base: 1GB, rank 79 https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2: 1GB, 143 https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2: 0.5GB, 148 https://huggingface.co/intfloat/multilingual-e5-small: 0.5GB, 117 (best <250M model)

https://huggingface.co/BAAI/bge-m3: 2GB (low ranked)

StephanAkkerman commented 2 weeks ago

FastText: Word vectors for 157 languages

StephanAkkerman commented 1 week ago

https://huggingface.co/intfloat/multilingual-e5-small: all scores are between 0.7 and 1.0 which means we need to find a method to scale it back to something more useful for us