Open leileilin opened 3 years ago
You can use it for any language combination, the language doesn't matter. So you can compare chinese vs English
You can use it for any language combination, the language doesn't matter. So you can compare chinese vs English
but in quickstart
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode("This is a red cat with a hat.") emb2 = model.encode("这是一只戴着帽子的红猫。")
cos_sim = util.cos_sim(emb1, emb2) print("Cosine-Similarity:", cos_sim)
the result is shown like that:
Cosine-Similarity: tensor([[0.0098]])
it got very low score, even the pairs is almost same
Because this is an English only model
Because this is an English only model
but if i put two chinese sentences into the model, it still get good results when do the similarity computing.
Put some more Chinese sentences, also dissimilar once, and you will see that it doesn't work well. The model does not understand Chinese and most characters are mapped to the Unknown token symbol
hey man, This is my first contact with sentence embedding. I have some doubts. The multilingualism here refers to the semantic similarity between Chinese sentences and Chinese sentences, and the similarity between English sentences and English sentences, not the similarity between English sentences and Chinese sentences? thank u.