UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.94k stars 2.44k forks source link

Behaviour for non-relevant queries #602

Open datistiquo opened 3 years ago

datistiquo commented 3 years ago

Hey,

I am building an information retrieval system fine tuning a siamese bert model on FAQ data in the form (question, answer, label) and using cosine sim for query and document. Both question and answer are actually similiar sentences. Now this works pretty well. My issue is that the model gives high probability to some non relevant and out of scope queries (for a specific document in my pool). So my thoughts were that using bert is making this better because of a pretrained language model reducing more this issue of overfitting. I wonder how I can handle this situation? Is it better to train with a triplet loss? Using contrastive loss did make a better job.

Are there any advices to make a model more robust to give low probabilities to really non relevant out of scope subjects?

But actually I have difficulties in imagine and understanding why this happens. If you have both emeddings of a query and a document and performing cosine or any othe roperation in this vector space (like clustering) I would assume that all relevant docs are nearby like in this pretty picture:

Alt text

Why then has always any model trained on a specific FAQ domain such issues with queries from another domain or totally out of scope things like words "and" and "hello".

nreimers commented 3 years ago

Hi @datistiquo I think a reason that one word queries score high can be due to the averaging.

The query and is internally represented as [CLS] and [SEP]

and hello as: [CLS] hello [SEP]

For the averaging of the embeddings, the output for CLS and SEP are also used. So this might be the reason why docs have high scores even for unrelated query, as all entry have the CLS and SEP token in the average.

I think the best loss for training retrieval models is: https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss

What works well is to combine retrieval model with a re-ranking model based on Cross-Encoder: https://www.sbert.net/examples/applications/information-retrieval/README.html

The cross-encoder will filter out such bad, unrelated docs.

But note, the cosine similarity score itself is not so meaningful. If the score is 0.5, you cannot conclude from this number if this is high or not. More relevant is always the relative scores, i.e. what are scores for the other docs.

If you have a query without any matching docs, the retrieval will still find some docs and return some scores. The scale of the score depends on the shape of the vector space. If all vectors are concentrated in one area, the cos sim score will be quite high.

For more details on this, see: https://www.aclweb.org/anthology/D19-1006/ https://www.aclweb.org/anthology/2020.emnlp-main.733/

datistiquo commented 3 years ago

Ok, I try to remove this tokens (I already tried to remove the CLS token, but this does not improve things significantly).

But note, the cosine similarity score itself is not so meaningful. If the score is 0.5, you cannot conclude from this number if this is high or not. More relevant is always the relative scores, i.e. what are scores for the other docs.

Yes, I always look for retrieval at the ranking scores. Sadly, non relevant queries can still have high probs for some docs at the first rank. So just having a treshold to declare it as non relevant is working that well.

The scale of the score depends on the shape of the vector space. If all vectors are concentrated in one area, the cos sim score will be quite high.

This reads like that it is impossible to train properly a retrieval on a specific domain?

I thought training with a loss like contrastive or triplet with negative examples will reduce this issue?

I think the best loss for training retrieval models is: https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss

Is this similiar to the triplet loss? I generate the negative examples by my own from the data and train it wiith a contrastive or a triplet loss.

nreimers commented 3 years ago

The Multiple Negative Ranking loss works usually much better than contrastive / triplet loss, as it computes within a batch many triplet pairs, so the learning is maximized

datistiquo commented 3 years ago

, as it computes within a batch many triplet pairs, so the learning is maximized

That sounds intersting. I will try this.

Could you use this loss when you already have some structure with positive and negative examples?

The scale of the score depends on the shape of the vector space. If all vectors are concentrated in one area, the cos sim score will be quite high.

This reads like that it is impossible to train properly a retrieval on a specific domain?

I knew this behaviour already from using simple word emebedding approaches with NNs. But thought models like bert would capture this when trained on a domain to handle not related topics?

Any idea how to handle this (besides using a classification beforehand?)?

datistiquo commented 3 years ago

@nreimers Thank you. I test this loss.

I saw that you also have created the training part for the bicoder which I do up to now by my own using huggingface. SO shame on me that I did not use sbert! :)

There you use just one max lenght for both encoders. Would it be suitable for using different max length each to embed? If you have a short query and longer answer text, then I would like to use a shorter max lenght for the query embeddings. Is there actually some differences training short queries with much longer max lenght? I assume the default for the transformers is to ignore the masked values in the attention mechanism?

I saw you do not use any ratio for positive to negative there whereas in training the crossencoder you did. Any reason for this?

Also, the dataset already contains the negative examples.

Negative passage are hard negative examples, that where retrieved by lexical search.

What is actually meant by "hard negatives"?

I am very excited to see the (hopefully) improvements on my case! :)

nreimers commented 3 years ago

There you use just one max lenght for both encoders. Would it be suitable for using different max length each to embed? If you have a short query and longer answer text, then I would like to use a shorter max lenght for the query embeddings.

This is not really needed. The text is padded to the minimal needed length, independent what you set as max length. Max length just truncates too long inputs. If you always have short queries, this parameter will not play a role at all.

I saw you do not use any ratio for positive to negative there whereas in training the crossencoder you did. Any reason for this? The loss (MultipleNegativesRankingLoss) uses all other examples in a batch as negatives.

If you have 32 triplets: (query, answer, hard_negative) Then all other answers and hard_negatives are used as negative, so the query has 1 positive and 63 negatives in that batch.

What is actually meant by "hard negatives"?

Hard negatives are negatives that are hard to differentiate from the positive. If you have a query like "how many people live in London", and you choose any other text as your negative, it will likely not talk about London but maybe about oxygen. So for the network it is easy to identify what the correct answer is, i.e. the random negative is easy.

A hard negative will be similar to the correct answer. For example, it talks about the history of London. The forces the model to learn better representations that actual match the query (London + inhabitants)

datistiquo commented 3 years ago

If you have 32 triplets: (query, answer, hard_negative) Then all other answers and hard_negatives are used as negative, so the query has 1 positive and 63 negatives in that batch.

Ok, this explains a lot. As you used in your bi-encoder example also negatives as input I thought this is also captured that the loss just uses thouse negative examples. Because I assumed that you just handover positive examples as it creates the negative ones. Is there any intro examples for this loss? With your above expalantion I think this is the issue for my problem here: https://github.com/UKPLab/sentence-transformers/issues/606

1 positive and 63 negatives in that batch.

How do you get this number? I feel this explains why my trainingdata explodes so massively...

That is actually bad, can you restrict this just using few negative examples?

nreimers commented 3 years ago

https://arxiv.org/pdf/1705.00652.pdf

There they describe the loss

datistiquo commented 3 years ago

Just for interest before I use it, is there any example for the any of the triplet losses in your repo? It seems there is no.

datistiquo commented 3 years ago

@nreimers

Is it possible to use the MultiRankingLoss also with negatives not coming from the positive examples? I would like to add addtional negatives randomly from another sources. I think the only way is to stick with the triplet form, but instead using hard negatives I apply other negatives?

Is it also possible to somehow customize the ratio of negatives to use in each batch? I feel so many negatives makes learning difficult. I did make good experience with smaller ratios like 5 (sim to that what you did for training ir cross-encoder)

:)

btw: is it somehow possible to chat for such questions on another platform like gitter?

datistiquo commented 3 years ago

This is not really needed. The text is padded to the minimal needed length, independent what you set as max length. Max length just truncates too long inputs. If you always have short queries, this parameter will not play a role at all.

I referred to this for training a bi-encoder where you potentially can have 2 models to encode each or using the same model but using different max lenghts. Because you have of course many padding values for the short queries. That's why the question also about the padding values (how they influence the attention)...