UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.88k stars 2.44k forks source link

How to use it for medium to large answers of multiple sentences ? #340

Open ankitkr3 opened 4 years ago

ankitkr3 commented 4 years ago

Hi, I want to calculate similarity between two answers that could be of more than 2 sentences. How can i achieve it using your models, please help.

Thanks in advance, Ankit.

nreimers commented 4 years ago

Hi @ankitkr3 You can pass multiple sentences as one string to model.encode. That is no issue.

Note, that by default, the models we provide have a word piece limit of 128. You could increase that. In general, BERT & co. have a word piece limit of 512 tokens.

Further, the models provided here were only trained on single sentences. For longer paragraphs, the representational power might not be as good.

Best Nils Reimers

ankitkr3 commented 4 years ago

@nreimers Is there any approach to improve representations power for longer paragraphs?

nreimers commented 4 years ago

Hi @ankitkr3 In general there are no universal usable text representations. It always depends heavily on your task, what type of text representations you need. Do you want to perform Information Retrieval, or Clustering, or Question Answering, or Semantic Textual Similarity, or Paraphrase Mining....

For each of these tasks, embeddings with different properties are useful.

Currently we train models for Information Retrieval & Question Answering, where you have a query / question given and want a matching paragraph from a large corpus like wikipedia. For this task, several datasets exists like Natural Question and MS Marco.

However, in general I am not aware of a lot of datasets that work on paragraphs.

ankitkr3 commented 4 years ago

@nreimers i basically want to evaluate answers semantically (for the example given an actual answer and written answer). currently, i have a dataset marked on the pattern where i have two answers(main and duplicate) with classes neutral and negative, How can i train it so that your vector representations become powerful for large paragraphs?

pashok3d commented 4 years ago

I guess the most simple way is to take an avarage of your sentence embeddings from two answers. Also, if you comparing two answers, you can look at Word Mover’s Distance algorithm. Basically, BERT training procedure with Next Sentence Prediction can be corellated with your case. As for more sophisticated approaches, I think you can search for papers where BERT uses context of neighbour sentences or somehow deals with multiple sentences as input.

ankitkr3 commented 4 years ago

@nreimers i don't understand how next sentence prediction would really help. I just wanna compare similarity(semantic) between two paragraphs. Please help me if i sound naive here.

pashok3d commented 4 years ago

Try averaging your sentence embeddings. Also you can try WMD. These are pretty easy to experiment with. If it won't work, than I suggest searching for more complicated approaches.

ankitkr3 commented 4 years ago

@nreimers @pashok3d is there any script available in this repository for calculating average embeddings?