UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

Necessary steps to improve sentence embeddings of "distiluse-base-multilingual-cased"? #212

Open khalo-sa opened 4 years ago

khalo-sa commented 4 years ago

Hi Nils,

first I would like to start by thanking you for open sourcing your code, and pretrained models related to your exciting research, I really appreciate your work 👍.

Specifically, I am fairly impressed by the quality of the embeddings created by the distiluse-base-multilingual-cased model, and the convenient interface you provided with your code.

For my use case, I primarily evaluated the quality of German sentence embeddings, and I think the performance is unprecedented. However, when it comes to more domain specific (German) sentences, predictably the performance dropped.

Now I wonder if I could get any of the following models, and which steps would be necessary to do so.

(Prioritized) A model with the multilingual capabilities of "distiluse-base-multilingual-cased" but with improved embeddings of German sentences in general, and of custom domain German sentences in particular.

(Optionally) A solely German model with better embeddings of German sentences in general (as compared to distiluse-base-multilingual-cased), and of custom domain German sentences in particular.

W.r.t. datasets, I learned from the documentation you wrote, that I would need a German STS, or NLI dataset to improve in German, right?

I found this ressource that includes German STS data. The examples look like this:

de-T16 Ich habe die Körper gesehen . Ich sah die Leichen . 4.0 de-T28 Weißt du , warum du hier bist ? Wissen Sie , wieso Sie hier sind ? 3.5 de-T35 Euch nahe Geister rufen wir . Nahe Geister , kommt herbei . 4.0

I think this is what I need, right? Regarding the more specific domain data, I'm afraid that I have too few examples (~100), even for finetuning purposes, but at least I could use these as test data to measure the impact of finetuning on the general German STS data.

Best wishes!

nreimers commented 4 years ago

Hi @khalo-sa Happy to hear that you find this repo useful :)

There is also a third option: Extend a well working English model to German using the approach & code described here: https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/multilingual-models.md

The challenge in your case is, I think, the domain dependence. Sentence embeddings are often general purpose sentence embeddings, i.e., they know how man & woman are related. But they fail to have good domain knowledge, i.e., they do not know that tensorflow and pytorch have a high similarity while the similarity between tensorflow and React would be low.

Even with a perfect German NLI & STS dataset you would not cover your domain dependent information. The sentences in NLI & STS are rather simple sentences like 'A man plays a guitar'.

In order to improve the performance for your use case, I would try to improve the quality on your domain. For that, you would need some large scale dataset from which you can infer if two sentences are similar or not.

Often, this can be derived from some structure in your data. For computer science text, you could use the tag information from StackOverflow and two sentences / questions are related if they share tags. Then you could use triplet loss. Another sample is SPECTER, that uses citation information from publication to train sentence embeddings: https://arxiv.org/abs/2004.07180

khalo-sa commented 4 years ago

Hi, and thanks for the quick reply :) Let me try to see if I understand correctly.

There is also a third option: Extend a well working English model to German using the approach & code described here: https://github.com/UKPLab/sentence-transformers/blob/master/docs/pretrained-models/multilingual-models.md

So the well working English (teacher) model in this scenario is something like a base transformer language model (LM) like "bert-base-uncased", that has been finetuned on English NLI or STS or (NLI and STS) data? I guess this first step is what you report in this paper? Then as a final step, I would train the model on German-English sentence pairs, to extend the English sentence embedding capabilities to German.

Isn't this process exactly what you did to produce the "distiluse-base-multilingual-cased" model, just with more Languages than German?

If I was only interested in German sentence embeddings, couldn't I simply take a pretrained German base transformer LM, and train it solely on the German STS exampes I mentioned before?

In order to improve the performance for your use case, I would try to improve the quality on your domain

Let me shed some light on the data I'm working with. The challenge is to match user questions to already existing, publicly available FAQ questions.

For that, you would need some large scale dataset from which you can infer if two sentences are similar or not.

I was afraid that this would be necessary, simply because I feel that I won't find such data. But I'm not completly sure which type of dataset you are thinking of. Do you mean a big number (> 10k?) of German-German sentence pairs from my domain, each labelled with a similarity score? Is it enough to have a discrete similarity measure (similar/dissimilar)?

edit

I think I have to correct what I said about the supposedly German STS dataset. I think it is actually just a paraphrase dataset, meaning that each of the sentence pairs is considered "similar", but not to which degree. I assume that this makes it unusable as training data for a sentence transformer, right?

nreimers commented 4 years ago

Regarding your edit: You could use triplet loss for that with a random negative example.

Yes, you can of course only train on German data. But I am not sure if STS data would be that helpful for your task.

Often, the STS data only covers rather generic sentences, e.g., "A man play with a ball". They seldom have really specific sentences in it. A sentence embedding model on this data can of course not learn how these more specific terms are related to each other. It has never seen them during training.

But these specific terms are often quite important for finding similar sentences. Having domain specific labeled data would be needed to learn the relations for these specific terms. Sadly I cannot say how much data you would need, but I think 1000+ pairs would be needed.

Similar / Dissimilar as labels would be sufficient, with that, you could train the model.