Unexpected high similarity

chikubee commented 5 years ago

I am using bert-base-nli-stsb-mean-tokens model in an unsupervised fashion to get similarity between sentences. It performs really good for some cases. But on doing extensive analysis, I found some cases where such high score for similarity makes no sense.

I am trying to figure out why the similarity is so high for cases where sentences are extremely short or make no sense at all What is really happening here? Any leads would be helpful.

Thanks in advance, for your reference

nreimers commented 5 years ago

@chikubee Thanks for pointing this out.

I also noticed that it performs poorly if the domain shift is too large. BERT was trained on nice, clean and long sentences from wikipedia. BERT-NLI was also tuned on nice, clean sentences (from the NLI dataset). So during training, neither BERT nor this sentence embeddings version was confronted with noisy data or with short sentences. In these cases, I expect the behaviour to be quite unpredictable, leading to non-sensible embeddings.

For short phrases (up to maybe 5 words), I think the best approach is to use average word embeddings, or even better, to use word embeddings that were trained on phrases or bigrams. Such word embeddings are for example the Google news word2vec embeddings or these embeddings (haven't tested them): https://www.kaggle.com/s4sarath/word2vec-unigram-bigrams-

I currently work on some training setups that trains BERT also on noisy data, which leads hopefull to better embeddings for noisy data. But it will take some time until I will finish that training setup.

chikubee commented 5 years ago

@nreimers Thanks for the explanation. It works well for long sentences, and to some extent short sentences too. But for noise like this. It fails.

Even I expected the cause to be the training on a specific-kind and length of data.

In BERT, I can fine-tune the model with the max sequence length, and it works quite well for the classification task with short sentences.

Since this model for semantic similarity promises to work well in unsupervised scenarios, It has to be robust to different kinds of input data. Good to know that you are working on the variability and not keeping the model stringent.

Since, the variability in my dataset is quite high, the sentences are of average length 7, but there are sentences with lengths in the range (10-80), I would not want to go with averaging word2vec embeddings, What else would you suggest?

Thanks again.

chikubee commented 5 years ago

@nreimers Hi, I was wondering if you could help me understand why there is extremely low similarity for some genuinely similar cases. Sentence1: I am a salaried person Sentence2: I am a person who gets salary The similarity between the tokens salaried and salary is extremely less. While Sentence2 actually matches with other sentences in the corpora relevant to salary, Sentence1 does not. Does it have something to do with the word-pieces as salaried gets broken into sal ##ari ##ed?

nreimers commented 5 years ago

@chikubee I think that is an interesting observation.

word2vec & Co. was trained based on the distributional hypothesis: Words in similar contexts share a similar meaning. Hence, there salary and salaried should get quite similar word embeddings.

For BERT, there are two issues: First, as you mentioned, the word-pieces can be quite different for small variations of words. Second, it was not trained with distributional hypothesis in mind, hence, I would not expect that word embeddings match well for the BERT output.

chikubee commented 5 years ago

@nreimers I did not understand why would you not expect them to have similar word embeddings. word2vec was also looking at the words surrounding the key term to learn, so is BERT. Its model to predict the next word still comes from the other words. Also, I was thinking is there some other way word-pieces can be combined to get us the apt token vector, maybe that's the place where nobody is scratching. Spacy Transformers released word alignments in this regard, but thats weighted sum, and it doesn't work either.

nreimers commented 5 years ago

@chikubee It would be interesting to study to behaviour of the word embeddings from BERT with respect to similarity.

There are two different BERT models: One that masked the subpieces and one that masked whole words.

If 'salary' is one token and 'sal ##ari ##ed' are three tokens, then in the first case maybe only ##ari is masked and sal and ##ed is not masked. In the second case, whole word masking, all 3 units of salaried is masked.

The question appears how to compare the similarity between 'salary' and 'sal ##ari ##ed'? Maybe do an average over the 3 subunits? Or just use the first unit? Or do some max-pooling?

Further, the token ##ed is not exclusive to the word salaried, the embedding for ##ed is used for many words that used ##ed at the end. Hence, there is no strict reason that ##ed should be close to the token 'salary'. Similar for the units 'sal' and '##ari', they are used for many different tokens, so their output can be somewhere in vector space, not necessarily close to 'salary'.

I could imagine that you would need to learn some smart function that maps ['sal', '##ar, '##ied'] to one vector that is close to ['salary'].

Note, 'salary' is in the SentencePiece tokenizer also only one piece, i.e. 'salaryman' will be broken down to ['salary', '##man']. From the token 'salary', we don't know if the word was salary or 'salaryman' or maybe 'salarymeningitis' or what ever word that starts with 'salary'. In conclusion, this poses and extremely hard problem to generate word embeddings such that the embeddings for individuals words are similar.

I think ELMo might be the better approach for your task: In ELMo, tokens are kept intact, not create these many problems from the subunits.

chikubee commented 5 years ago

Thanks for the explanation @nreimers. Yeah, you're right about ELMO.

But knowing how rich BERT embeddings can be, I want to make the most out of them. I will try the whole word masking model, In that case atleast a wordpiece token would learn in the presence of other word-pieces of a particular token. Will try to infer the embeddings wrt similarity and come back to you if I find something.

Thanks again.

nreimers commented 5 years ago

That would be interesting @chikubee

Let me know when you find something of interest.

dheerajiiitv commented 4 years ago

I am using bert-base-nli-stsb-mean-tokens model in an unsupervised fashion to get similarity between sentences. It performs really good for some cases. But on doing extensive analysis, I found some cases where such high score for similarity makes no sense.

I am trying to figure out why the similarity is so high for cases where sentences are extremely short or make no sense at all What is really happening here? Any leads would be helpful.

Thanks in advance, for your reference

@nreimers @chikubee isn't that a reason that it gives high similarity because both sentences are too short so our model not able to find where are they actually in the vector space and that also goes for the test query, I think because of this it is giving high similarity.

devilteo911 commented 2 years ago

Since it is still signed as open, I bump up this issue again. I'm facing a similar problem with a noisy dataset with sentences of variable length. Sometimes I have correct matches between sentences, sometimes it gives me really high confidence scores (even in the 0.96-0.97 range) but then, when I watch the paired sentences, I cannot find a single clue of why they were paired.

UKPLab / sentence-transformers

Unexpected high similarity #14