How do I create my own domain-specific "stsb" datset?

PhilipMay commented 3 years ago

Hi, I would like to create my own domain-specific "stsb" datset to further improve performance. I have a 500 GB domain specific text corpus and want to use / label some of the sentence pairs.

Do you have any experience / suggestions how to apply an active learning process to select the pairs and maximize the benefit of labeled data? Should I uniformly distribute the labels between 0.0 and 5.0 or pick more "equal" sentences?

Thanks Philip

nreimers commented 3 years ago

Hi @PhilipMay I used active learning only for traditional classification tasks. But there the results were not so great. Getting active learning right is quite difficult, as you do not really know which example will benefit the most.

Also you have an issue if the model that selects the samples for active learning is different from the model you later want to use. E.g. you use model A for active learning and it would benefit the most from example (x, y). But later you use model B and example (x, y) is irrelevant for that model.

Usually, when you select two random sentences (x, y), they have quite a low similarity / no similarity. So what you want is to find the few examples that have some or a high degree of similarity.

For this you need some initial model. Can be tf-idf or a pre-trained dense model. You retrieve similar pairs from this model and then label how similar you judge these. You can then use these annotated pairs to fine-tune your model.

It can be helpful to mix several models together that retrieve your candidates, e.g. to use TF-IDF and to use a pre-trained SBERT model.

This pre-selection of pairs introduces a bias, e.g. if you just use TF-IDF, you just have pairs with high lexical overlap => only pairs with high lexical overlap will be annotated as highly similar => your model will fail to learn to recognize that non-overlapping pairs can also be similar.

I would use the following setup:

Select sentence x randomly
Select one retrieval system at random (I would use 3 different types of retrieval systems). Retrieve top 100 or top 1k most similar sentences. Result size depends how much you want to skew your results towards positive examples
From your results list, pick one sentence y randomly
Annotate (x, y)

Note: Sometimes it can be hard to give a score 0-5 on how similar a pair is.

What is easier for a lot of users is to do a comparison. You select a triplet (x, y, z) with the above system and ask, given x, is sentence y or z more similar?

This is much easier for a lot of people.

In this paper: https://arxiv.org/pdf/2010.08240.pdf

we used a similar system for the creation of the BWS dataset. Details on the annotation process are in that paper and also in the appendix of the paper.

PhilipMay commented 3 years ago

Awesome. Thanks again @nreimers for this explanation.

I was planning to use the T-Systems-onsite/cross-en-de-roberta-sentence-transformer model for preselection.

I want to "mine" more "stsb like" data from my text corpus to mix it into the training process.

PhilipMay commented 3 years ago

From my experience the stsb labels (0 to 5) are much more useful for the model to learn from than just "same and not same" or these "contradiction, neutral, entailment" labels.

conraddonau commented 2 years ago

Hey there @nreimers, I'm in a similar situation as @PhilipMay, only I wish to adapt SBERT to generate embeddings of descriptions of technical skills. I would like to finetune SBERT using a combination of descriptions in the ESCO database (https://esco.ec.europa.eu/en/classification/skills?uri=http://data.europa.eu/esco/skill) and an ontology with the same format. Later, I plan to use the finetuned SBERT to cluster the descriptions of the skills.

Would it be sensible to use the approach you have described above (pick a random skill description A, use tfidf/word2vec/pretrained sbert to find top X similar descriptions, pick one of those at random (B), label similarity between A and B on 0-5 scale, repeat until enough training samples are labeled) and then fine-tune an existing sentence-transformer like you do in training_stsbenchmark_continue_training.py? (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark_continue_training.py)
How many training samples do I need before this even begins to make any sense, and is there any loss function or training objective that will allow me to use less training samples? Thank you very much in advance :)

SwapnilDreams100 commented 2 years ago

hey @conraddonau did you pursue this idea further? I am working on a similar issue and would love to hear how this approach went!

conraddonau commented 2 years ago

Hi there @SwapnilDreams100 - no I never got to it. The project I was working on went in a different direction, and I ended up using just a pretrained version for the task. So right back at you :) if you go down this road, I'd love to hear from you. Good luck!

UKPLab / sentence-transformers

How do I create my own domain-specific "stsb" datset? #859