UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.19k stars 2.47k forks source link

How to pass more than sentence pairs to InputExamples for fine-tuning? #2262

Open cyriltw opened 1 year ago

cyriltw commented 1 year ago

I have more information about each data point such as language and contextual data that could potentially help (maybe) for our task. The task is to generate sentence similarity embedding and labels.

For the time being, I was able to expand the input examples code to get these features in to expand the input.

Train_data = [‘sentence1’,’sentence2’,’textcategory1’,’label’]

Train_examples =[InputExample(texts=[x[0],x[1],x[2]],label=x[3]) for x in Train_data]

Since the textcategory1 gets encoded as well at the end of the input example in the form of sentence1[0];sentence2[0];textcategory1[0] separated by ;.

  1. How does this impact the overall input for a model since it doesnt just see a sentence pair but more?
  2. Does the fine-tuning layer see the two sentences as pairs or it sees as a single input and a label?
  3. Even though it works, if this is not the correct way how do I include the sense of tokens for the fine-tuning? I.e. use textcategory1 as or feature without messing with the embedding.
carlesoctav commented 1 year ago

What does the textcategory1 tell about? Does it mean sentence1 or sentence2 has the category of 'textcategory1', or just one of them?

  1. It's a giveaway about the topic. The vector space is probably more separated by each topic, but It won't do that much in terms of metric evaluation, I guess.

  2. I rarely use the sentence-transformers interface, so I can't tell. But in general, you need two separate sentences to train a bi-encoder, since it will encode each sentence independently and try to minimize the distance of similar text and vice versa.

  3. You can try appending the textcategory to each sentence, such as 'textcategory: sentence1' and 'textcategory: sentence2', and do the usual way of fine-tuning. This is inspired by instructor-base way of encode a text.