Closed Shafi2016 closed 3 years ago
Yes, it is the right way for semantic Search. You explained the pooling correctly
Great, Thank you so much!!
Hey, @Shafi2016 could you point me in the direction how you fine-tuned the Roberta model? I'm in the middle of completing a similar task and struggling to find a solution? Thank you very much!
Hello @NicolaiSchmid, Sorry for the delay in replying. I have used the old version. Things might have updated now. Let me if this works for you.
!pip install -U sentence-transformers
!pip install pyarrow==1.0.*
pip install -U transformers==3.5.1
!git clone https://github.com/huggingface/transformers
!git checkout v3.5.1
import os
os.chdir('/content/transformers')
!pip install .
!pip install -r ./examples/requirements.txt
os.chdir('/content/transformers/examples')
!python "/content/transformers/examples/contrib/legacy/run_language_modeling.py" \
--output_dir "/content/drive/MyDrive/Vancouver" \
--model_name_or_path roberta-base \
--do_train \
--per_gpu_train_batch_size 8 \
--seed 42 \
--train_data_file "/content/input_textOC.txt" \
--block_size 256 \
--line_by_line \
--learning_rate 6e-4 \
--num_train_epochs 3 \
--save_total_limit 2 \
--save_steps 200 \
--weight_decay 0.01 \
--mlm
#https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py
from sentence_transformers import SentenceTransformer
from sentence_transformers import models, losses
import scipy.spatial
import pickle as pkl
word_embedding_model = models.RoBERTa("/content/drive/MyDrive/Ottawa_citit")
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
@Shafi2016 hey can u share the code, the gihub link doesnt works now
@sanjay23singh You can find the docs here: https://www.sbert.net/examples/applications/semantic-search/README.html
@sanjay23singh, the above works with the older version of sentence-transformers
Hello, @nreimers, I am looking for a few clarifications. I have fine-tuned RoBERTa Language model using unlabeled newspaper articles using hugging face library. Now combining fine-tune RoBERTa with sentence transformer for semantic search. I am selecting articles based on cosine similarity to a given query. Though, I am getting very good results. First, Is this is the correct way of doing it. Besides, I am trying to understand other things. Second,I do not understand what is mean of "Apply mean pooling to get one fixed-sized sentence vector".
Thirds, want to understand the pooling part. By following your website notes. We use the RoBERTa model to map tokens in a sentence to the output embeddings from RoBERTa. The next layer in our model is a meaning pooling that is, we simply average all contextualized word embeddings RoBERTa is giving us. Each sentence is now passed first through the word_embedding_model and then through the pooling_model to give fixed-sized sentence vectors.
`https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py from sentence_transformers import SentenceTransformer from sentence_transformers import models, losses import scipy.spatial import pickle as pkl word_embedding_model = models.RoBERTa("/content/drive/MyDrive/Ottawa_citit")
Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
Corpus with example sentences
corpus = df_sentences_list corpus_embeddings = model.encode(corpus,show_progress_bar=True) with open("/content/corpus_finetuned_embeddings.pkl" , "wb") as f: pkl.dump(corpus_embeddings,f)`