Creating Paragraph/Document Embedding with SBERT?

tyker1 commented 4 years ago

Hi, first thx of this awesome library. I'm new in NLP feld so if the following question is naive, please forgive me.

Since BERT works on sentence or 2 sentences (if my understanding is correct), and SBERT is based on bert, so is that possible that i use SBERT to get Embeddings for paragrah or document?

And as i see in your publication, its a siamese network (i understand it as that the pretrained bert here for both sentence 1 and 2 are the same or using same weight), is it possible to implement a pseudo-siamese network here? So that it can deal with more complicated problems, like accepting a document as sentence 1 and the document title as sentence 2 and judge if the document contect suits the title?

nreimers commented 4 years ago

Hi, BERT works on any text up to 510 word piece tokens (which are about 300 - 400 words). So you can input also longer text into BERT.

Further, BERT provides the option of inputting two texts: text1 [SEP] text2

Here, text1 and text2 can also be more than one sentence (or shorter than one sentence).

For your pseudo-siamese network I wrote some explanation here: https://github.com/UKPLab/sentence-transformers/issues/107#issuecomment-576587421

You add a layer (e.g. a dense layer) that applies it function only if the input has some properties. This way you can input asymmetric inputs like doc<->title or doc<->query.

Best Nils Reimers

tyker1 commented 4 years ago

Hello @nreimers ,

i checked the explanation from the comment u mentioned, but still have some questions (here comes the detailed description):

you said that those asynmetrical network could be achieved when adding a dense layer which is active only when the input data got specific mark like features['input_type'] == 'document'

what i'm facing is a relatively small dataset for doing doc -> title, (with given anchor doc that descripes the title, e.g. a title "Car" and a document as reference which descripes each function of a "Car", and the goal is for a given doc with descriptions of something, and judge wether it belongs, in this situation, the "Car" document)

here is an example : the anchor document for "Car" says "cars have ABS", and the test document says "cars have a system that prevents its wheels from locking up when performing an emergency break, so that the car won't lose control during the break"

since the dataset itself is relatively small (about total 3000 paragrahs including anchor docs for more than 10 doc classes), i think it would be benificial to continue training on the pre-trained model like SLS or NLI, upgrade them into asynmetrical network etc.

the question is: if i understood the code correctly, i can only add layer when i use the contructor SentenceTransformer([Model1, Model2, ..., ModelN]) . When i load the pre-trained model, how can i add the asynmetrical dense layer into it?

Or maybe since it in fact should trying to match the docs to the functions that descripes the title in the given anchor, it is better to use the normal cosine-similarity model as you provide (or softmax)

i'd be very appreciate if you can also give some suggestions on the problem im facing (meaning the question in the last paragrah)

Thank You!

nreimers commented 4 years ago

the question is: if i understood the code correctly, i can only add layer when i use the contructor SentenceTransformer([Model1, Model2, ..., ModelN]) . When i load the pre-trained model, how can i add the asynmetrical dense layer into it?

Hi, you would create your SentenceTransformer model from scratch: SentenceTransformer([BERT, Pooling, Dense])

However, for BERT, you can load a BERT that was pre-trained on e.g. NLI data. For this, download the model you need: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/

Unzip it and then set your BERT model like this: bert = models.BERT('path/to/unzipped/sbert-model/0_BERT')

Then you the BERT that was previously fine-tuned on a specific task.

Best Nils Reimers

tmishinev commented 1 year ago

Hello, I'm trying to use (msmarco-distilbert-base-v4, msmarco-distilbert-dot-v5) for semantic search. Which one is more appropriate way to get the embeddings. Concatenating title + text or use title and text as separate sequences.

corpus = corpus['title'] + ' ' + corpus['text'] model.encode(corpus)

or

SEQ1 = corpus['title'] SEQ2 = corpus['text'] model.encode([[SEQ1, SEQ2], [SEQ1, SEQ2], [SEQ1, SEQ2] ...])

Thanks in advance. I have seen both approaches but the second one uses nq-distilbert-base-v1 and I can't understand if it's specifically for that model or it can be used with the msmarco bi-encoders also.

https://colab.research.google.com/drive/11GunvCqJuebfeTlgbJWkIMT0xJH6PWF1?usp=sharing

UKPLab / sentence-transformers

Creating Paragraph/Document Embedding with SBERT? #146