Open bendarodes opened 4 months ago
` import torch import numpy as np from transformers import AutoTokenizer, AutoModel from sklearn.metrics.pairwise import cosine_similarity
model_name = "boun-tabi-LMG/TURNA" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)
sentences = ["elma güzel meyvedir","armut lezzetlidir","kitap ","defter","çelik vida","banyo dolabı"]
outputs = [] def mean_pooling(sentence): input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1 print(input_ids) decoder_input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1 with torch.no_grad(): output = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
# Average pooling over tokens (excluding special tokens)
return output[0].mean(dim=1)[0]
n = len(embeddings)
upper_triangular = np.triu(cosine_similarity(embeddings), k=1)
similarity_matrix = upper_triangular + upper_triangular.T
print("Cosine similarity matrix:") print(similarity_matrix) `
I wrote this code. However, all similarity values are very close. Similarity scores are not accurate. I would be very grateful if you could help me understand where I made a mistake.
We haven't yet tested TURNA's performance for generating sentence embeddings. Your proposed approach seems logical. However, it's notable that the paper "Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models" explores various methods for encoding text/sentences using pre-trained T5 models. They found that utilizing the encoder and averaging its token representations performs better than using both encoder and decoder.
An alternative to the suggested method involves using only the encoder by loading it with the T5EncoderModel
class instead of AutoModel
. Here's an example of how to obtain embeddings using this method:
def mean_pooling(sentence):
input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
with torch.no_grad():
output = model(input_ids=input_ids)
return output.last_hidden_state.mean(dim=1)[0]
Additionally, consider exploring finetuned NLI and STS models for extracting embeddings:
https://huggingface.co/boun-tabi-LMG/turna_nli_nli_tr https://huggingface.co/boun-tabi-LMG/turna_semantic_similarity_stsb_tr
Please share your findings with us. I'm eager to learn about the results.
Hi @bendarodes , any news?
Hello, I have a problem, I have 1000's of sentences. I want to determine which of these 1000 sentences a sentence newly entered by the system is closest to. But I don't want it to analyze 1000 sentences every time. For this, I need the encode values of the sentences. But I couldn't get it. I would be very pleased if you could help me.