FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.76k stars 565 forks source link

supported language list of BGE-M3 #1228

Open yanfan0531 opened 2 weeks ago

yanfan0531 commented 2 weeks ago

Hi, what languages is BGE-M3 supported ? Could you please provide the a list of supported languages? It seems that the paper does not mention it. Thanks

hanhainebula commented 1 week ago

Hello, @yanfan0531! For the languages included in MIRACL, MKQA and MLDR, BGE-M3 shows excellent performance. The list of languages (merged from MIRACL, MKQA and MLDR): ['ar', 'bn', 'da', 'de', 'en', 'es', 'fa', 'fi', 'fr', 'he', 'hi', 'hu', 'id', 'it', 'ja', 'km', 'ko', 'ms', 'nl', 'no', 'pl', 'pt', 'ru', 'sv', 'sw', 'te', 'th', 'tr', 'vi', 'yo', 'zh', 'zh_cn', 'zh_hk', 'zh_tw'].

However, as shown in the Limitations section of our paper, the potential variations in the performance of BGE-M3 across different languages are not thoroughly discussed. If you want to know all languages included in the training data of BGE-M3, you can refer to the individual datasets to access all languages. For example, the list of languages for multilingual CC-News is available here.