Context embedding shows anomaly, independent of sentence and token

RommeZetaAlpha commented 4 years ago

Problem

While doing some analysis on the pre-trained SciBert transformer networks, we found that there is an anomaly in the context embedding on index 422. Doing some more tests we found that this anomaly was there, independent of the context or the specific token and position. We think that the distance metrics in the contextualized embedding-space, such as cosine similarity, are heavily dominated by this exploding component.

scibert-scivocab-uncased

To see the actual contextual representation of words, we infered a sentence using the pretrained model. We performed this using both a TF and PyTorch version of the model to see if the results were similar. Below we can see a crude representation of the embedding space (word + positional embedding) simply as a linear plot of the 768 components of the embedding for the word linear in the context of a regular sentence. The values seem nicely distributed around 0 and peaking just a few times just above 2 or -2, which is just as someone would expect the embedding space to look like.

However, if we look at the contextualized embedding, after the last layer of the SciBert model, we find the embedding has turned into this representation below. The representation "explodes" at a particular index (422) very consistently across tokens and sentences.

the token is linear is used in the context of the sentenceHow are linear regression and gradient descent related, is gradient descent a type of linear regression, and is it similar to ordinary least squares (OLS) and generalized least squares (GLS)?

scibert-scivocab-cased

We now performed the same analysis using the cased version of the model, which gives the following results. Here we have the token: difference for the sentence: Computer Vision: What is the difference between HOG and SIFT feature descriptor? The exploding index in this embedding is 421, meaning that the anomaly occurs at a different position.

To compared with the uncased model we did the same for the token linear for the sentence How are linear regression and gradient descent related, is gradient descent a type of linear regression, and is it similar to ordinary least squares (OLS) and generalized least squares (GLS)? Here we see that the anomaly again is consistent, this also holds for other examples.

From this we see that a similar problem occurs both in the cased and the uncased model, yet the issue is found on a different location from which we figure that something odd is happening in the training process. Again, we have tested this is a behavior in many other tokens and sentences and it's consistent across all circumstances.

Could you explain if this behavior is expected by you and why? And otherwise could you explain what causes this behavior and how it can potentially be overcome?

Thanks in advance!d

ibeltagy commented 4 years ago

Interesting. I have seen the same pattern while training transformers for another project. I don't know why this is happening, but it doesn't seem to be a bug

ibeltagy commented 4 years ago

Can you try this for regular BERT and see if you get the same pattern?

RommeZetaAlpha commented 4 years ago

We also tried this for the regular BERT uncased model. This gives us the following results.

The first image show the word difference in the context Computer Vision: What is the difference between HOG and SIFT feature descriptor?. This gives a clear peak at 308. This happens for more tokens but not as consistent as in Scibert.

Another example, the word linear in the context How are linear regression and gradient descent related, is gradient descent a type of linear regression, and is it similar to ordinary least squares (OLS) and generalized least squares (GLS)? creates an output that is much more expected.

Besides results for Bert, we found some tokens in which the pattern doesn't occur when using SciBert. These are for example symbols "),?" and the token The

Conclusion

We see a pattern in both Bert and SciBert, but it is not as strong and therefore we expect it to have less of an impact on the similarity metrics/distances. Not only is it not as consistent, Bert also shows many more peaks which causes a consistent peak to be of less impact.

Do you have any ideas what it can be caused by and what the possible implications are?

Thanks!

RommeZetaAlpha commented 4 years ago

Did you already have a look at this?

ibeltagy commented 4 years ago

As I said, I don't think this is a bug, it is just how the model decided to represent your tokens. As to the similarity measures, maybe normalizing the vector somehow will prevent this dimension to dominate the score

allenai / scibert