UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.7k stars 2.42k forks source link

some confusions #1287

Open leileilin opened 2 years ago

leileilin commented 2 years ago

hey man, This is my first contact with sentence embedding. I have some doubts. The multilingualism here refers to the semantic similarity between Chinese sentences and Chinese sentences, and the similarity between English sentences and English sentences, not the similarity between English sentences and Chinese sentences? thank u.

nreimers commented 2 years ago

You can use it for any language combination, the language doesn't matter. So you can compare chinese vs English

leileilin commented 2 years ago

You can use it for any language combination, the language doesn't matter. So you can compare chinese vs English

but in quickstart


from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

Sentences are encoded by calling model.encode()

emb1 = model.encode("This is a red cat with a hat.") emb2 = model.encode("这是一只戴着帽子的红猫。")

cos_sim = util.cos_sim(emb1, emb2) print("Cosine-Similarity:", cos_sim)



the result is shown like that:
Cosine-Similarity: tensor([[0.0098]])
it got very low score, even the pairs is almost same
nreimers commented 2 years ago

Because this is an English only model

leileilin commented 2 years ago

Because this is an English only model

but if i put two chinese sentences into the model, it still get good results when do the similarity computing.

nreimers commented 2 years ago

Put some more Chinese sentences, also dissimilar once, and you will see that it doesn't work well. The model does not understand Chinese and most characters are mapped to the Unknown token symbol