UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.1k stars 2.46k forks source link

Semantic search with Fine-tune RoBERTa. #603

Closed Shafi2016 closed 3 years ago

Shafi2016 commented 3 years ago

Hello, @nreimers, I am looking for a few clarifications. I have fine-tuned RoBERTa Language model using unlabeled newspaper articles using hugging face library. Now combining fine-tune RoBERTa with sentence transformer for semantic search. I am selecting articles based on cosine similarity to a given query. Though, I am getting very good results. First, Is this is the correct way of doing it. Besides, I am trying to understand other things. Second,I do not understand what is mean of "Apply mean pooling to get one fixed-sized sentence vector".

Thirds, want to understand the pooling part. By following your website notes. We use the RoBERTa model to map tokens in a sentence to the output embeddings from RoBERTa. The next layer in our model is a meaning pooling that is, we simply average all contextualized word embeddings RoBERTa is giving us. Each sentence is now passed first through the word_embedding_model and then through the pooling_model to give fixed-sized sentence vectors.

`https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py from sentence_transformers import SentenceTransformer from sentence_transformers import models, losses import scipy.spatial import pickle as pkl word_embedding_model = models.RoBERTa("/content/drive/MyDrive/Ottawa_citit")

Apply mean pooling to get one fixed sized sentence vector

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Corpus with example sentences

corpus = df_sentences_list corpus_embeddings = model.encode(corpus,show_progress_bar=True) with open("/content/corpus_finetuned_embeddings.pkl" , "wb") as f: pkl.dump(corpus_embeddings,f)`

nreimers commented 3 years ago

Yes, it is the right way for semantic Search. You explained the pooling correctly

Shafi2016 commented 3 years ago

Great, Thank you so much!!

NicolaiSchmid commented 3 years ago

Hey, @Shafi2016 could you point me in the direction how you fine-tuned the Roberta model? I'm in the middle of completing a similar task and struggling to find a solution? Thank you very much!

Shafi2016 commented 3 years ago

Hello @NicolaiSchmid, Sorry for the delay in replying. I have used the old version. Things might have updated now. Let me if this works for you.


!pip install -U sentence-transformers
!pip install pyarrow==1.0.*
pip install -U transformers==3.5.1
!git clone https://github.com/huggingface/transformers
!git checkout v3.5.1
import os
os.chdir('/content/transformers')

!pip install .
!pip install -r ./examples/requirements.txt

os.chdir('/content/transformers/examples')

!python "/content/transformers/examples/contrib/legacy/run_language_modeling.py" \
    --output_dir "/content/drive/MyDrive/Vancouver" \
    --model_name_or_path roberta-base \
    --do_train \
    --per_gpu_train_batch_size 8 \
    --seed 42 \
    --train_data_file "/content/input_textOC.txt" \
    --block_size 256 \
    --line_by_line \
    --learning_rate 6e-4 \
    --num_train_epochs 3 \
    --save_total_limit 2 \
    --save_steps 200 \
    --weight_decay 0.01 \
    --mlm
#https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py
from sentence_transformers import SentenceTransformer
from sentence_transformers import models, losses
import scipy.spatial
import pickle as pkl
word_embedding_model = models.RoBERTa("/content/drive/MyDrive/Ottawa_citit")

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
sanjay23singh commented 3 years ago

@Shafi2016 hey can u share the code, the gihub link doesnt works now

nreimers commented 3 years ago

@sanjay23singh You can find the docs here: https://www.sbert.net/examples/applications/semantic-search/README.html

Shafi2016 commented 3 years ago

@sanjay23singh, the above works with the older version of sentence-transformers