UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.26k stars 2.47k forks source link

which approach to go in for - crossencoder vs bi-encoder #1707

Open ud2195 opened 2 years ago

ud2195 commented 2 years ago

Hi, thank you for creating such an awesome library.

My question:- I have a task at hand wherein I have abbreviation, sense, text

example AB, abortion, The patient does have a known history of having had a missed AB

these abbreviations can have multiple senses for example 'AB' can mean 'abortion' but can also mean 'blood group in ABO system'

I have multiple such abbreviations with each abbreviation having multiple such senses. In such a scenario if i have to predict what sense is the abbreviation about given the full text. What should i be using ?

If i use a cross encoder then that'll mean treating it as a sentence pair classification task and passing the whole pair through bert at once multiple times and comparing the full text with each and every sense of the abbreviation like

for abbreviation AB and sentence patient does have a known history of having had a missed AB sentence1 sentence2 label patient does have a known history of having had a missed AB, abortion, 1 patient does have a known history of having had a missed AB, blood group in ABO system, 0

I am not able to justify ^ when it comes to scale

OR

I can go ahead with bi encoder approach where in I train a sentencetransformer model using https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss (since i have only positive pairs) and https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py

and at time of inference given the sentence and abbreviation I can just compute the similarity b/w sentence and each sense of abbreviation (storing the sense vectors beforehand) and return the one with max cosine similarity

which approach will be better off given the above scenario ? Thank you in advance!

ud2195 commented 2 years ago

Also, if i go ahead with multiplenegativerankingloss will it be right to input data in format

sent1 sent2 abortion patient does have a known history of having had a missed AB abortion patient just had an AB

as per documentation _This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)…, (a_n, p_n) where we assume that (a_i, p_i) are a positive pair and (a_i, pj) for i!=j a negative pair.

so will my above data mean a1 != p_2 as that'll be wrong. Does that mean i should only be adding 1 sense per abbreviation and NOT multiple same senses per abbreviation ?