boun-tabi-LMG / turkish-lm-tuner

Turkish LM Tuner
https://boun-tabi-lmg.github.io/turkish-lm-tuner/
MIT License
73 stars 6 forks source link

Multiple Semantic Textual Similarity Problem #71

Open bendarodes opened 4 months ago

bendarodes commented 4 months ago

Hello, I have a problem, I have 1000's of sentences. I want to determine which of these 1000 sentences a sentence newly entered by the system is closest to. But I don't want it to analyze 1000 sentences every time. For this, I need the encode values of the sentences. But I couldn't get it. I would be very pleased if you could help me.

bendarodes commented 3 months ago

` import torch import numpy as np from transformers import AutoTokenizer, AutoModel from sklearn.metrics.pairwise import cosine_similarity

Load T5 model and tokenizer

model_name = "boun-tabi-LMG/TURNA" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)

sample sentences

sentences = ["elma güzel meyvedir","armut lezzetlidir","kitap ","defter","çelik vida","banyo dolabı"]

Encode sentences using mean pooling

outputs = [] def mean_pooling(sentence): input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1 print(input_ids) decoder_input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1 with torch.no_grad(): output = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)

# Average pooling over tokens (excluding special tokens)
return output[0].mean(dim=1)[0]

Assuming you have 'embeddings' as your data

n = len(embeddings)

Compute cosine similarity for upper triangular part

upper_triangular = np.triu(cosine_similarity(embeddings), k=1)

Fill the lower triangular part

similarity_matrix = upper_triangular + upper_triangular.T

Print the similarity matrix

print("Cosine similarity matrix:") print(similarity_matrix) `

I wrote this code. However, all similarity values are very close. Similarity scores are not accurate. I would be very grateful if you could help me understand where I made a mistake.

gokceuludogan commented 3 months ago

We haven't yet tested TURNA's performance for generating sentence embeddings. Your proposed approach seems logical. However, it's notable that the paper "Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models" explores various methods for encoding text/sentences using pre-trained T5 models. They found that utilizing the encoder and averaging its token representations performs better than using both encoder and decoder. An alternative to the suggested method involves using only the encoder by loading it with the T5EncoderModel class instead of AutoModel. Here's an example of how to obtain embeddings using this method:

def mean_pooling(sentence):
    input_ids = tokenizer(sentence, return_tensors="pt").input_ids # Batch size 1
    with torch.no_grad():
        output = model(input_ids=input_ids)
    return output.last_hidden_state.mean(dim=1)[0]

Additionally, consider exploring finetuned NLI and STS models for extracting embeddings:

https://huggingface.co/boun-tabi-LMG/turna_nli_nli_tr https://huggingface.co/boun-tabi-LMG/turna_semantic_similarity_stsb_tr

Please share your findings with us. I'm eager to learn about the results.

onurgu commented 2 months ago

Hi @bendarodes , any news?