Open cyriltw opened 1 year ago
What does the textcategory1 tell about? Does it mean sentence1 or sentence2 has the category of 'textcategory1', or just one of them?
It's a giveaway about the topic. The vector space is probably more separated by each topic, but It won't do that much in terms of metric evaluation, I guess.
I rarely use the sentence-transformers interface, so I can't tell. But in general, you need two separate sentences to train a bi-encoder, since it will encode each sentence independently and try to minimize the distance of similar text and vice versa.
You can try appending the textcategory to each sentence, such as 'textcategory: sentence1' and 'textcategory: sentence2', and do the usual way of fine-tuning. This is inspired by instructor-base way of encode a text.
I have more information about each data point such as language and contextual data that could potentially help (maybe) for our task. The task is to generate sentence similarity embedding and labels.
For the time being, I was able to expand the input examples code to get these features in to expand the input.
Since the
textcategory1
gets encoded as well at the end of the input example in the form ofsentence1[0];sentence2[0];textcategory1[0]
separated by ;.