Muennighoff / sgpt

SGPT: GPT Sentence Embeddings for Semantic Search
https://arxiv.org/abs/2202.08904
MIT License
823 stars 51 forks source link

does it support Korea and Japanese? #35

Open sz2three opened 1 year ago

sz2three commented 1 year ago

it supports Chinese, while does it also work for Korea and Japanese?

Muennighoff commented 1 year ago

You can check Section 4.4 of the MTEB paper (https://arxiv.org/pdf/2210.07316.pdf) where https://huggingface.co/bigscience/sgpt-bloom-7b1-msmarco is benchmarked on many languages incl. Korean & Japanese against other models. As it hasn't extensively seen them in pre-training it performs rather poorly on them.

You may want to use a different model for those languages (check e.g. this leaderboard to see what's best: https://huggingface.co/spaces/mteb/leaderboard for those languages).