Chinese encoding exception: encode('婚礼')=encode('菜单')

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.78k stars 2.43k forks source link

Chinese encoding exception: encode('婚礼')=encode('菜单') #1671

Open on1you opened 2 years ago

on1you commented 2 years ago

As long as the word length is the same, the encoding is the same

from sentence_transformers import SentenceTransformer,util embedder = SentenceTransformer('msmarco-distilbert-base-v4') corpus_embeddings = embedder.encode(['婚礼','菜单','招聘','邀请'], convert_to_tensor=True) encode('婚礼')=encode('菜单')=[-1.16728060e-01 1.81547254e-01 -1.05594993e-02 -4.06406701e-01.....]

nreimers commented 2 years ago

That model only works for English

on1you commented 2 years ago

That model only works for English

thank you very much, Which models support Chinese? I only found five Multi-Lingual Models

on1you commented 2 years ago

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine

the real user not contain Chinese?