sentence-transformers vs. transformers for semantic similarity

UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT

Apache License 2.0

14.72k stars 2.43k forks source link

Based on ~5000 paper abstracts (mechanical engineering domain) I want to find the 100 most similar ones. After some research i found that there are some options:

The transformers library from huggingface and any language model (e.g. distilbert-base-uncased), get the [CLS] embedding and find the most similar embeddings, eventually with FAISS to speed it up
Feed 2 sentences into BERT (slow)
The sentence-transformers library, which seems to be made for this task and any available model (maybe AllenAI Specter)

Can someone clarify:

why the first approach is not recommended? if the sentence-transformer models generate more meaningful embeddings, why isn't it used for classification tasks as well?
if I have a few labelled examples for similar texts from my domain, can I fine-tune the sentence models, and if so, how?

UKPLab / sentence-transformers

sentence-transformers vs. transformers for semantic similarity #1477