YeonwooSung / ai_book

AI book for everyone
23 stars 4 forks source link

What Does BERT Look At? #68

Closed YeonwooSung closed 4 months ago

YeonwooSung commented 4 months ago

paper

Abstract

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

In short

If you visualize the attention map of each BERT layer, you will find out that the first few layers mostly activates on [CLS] token, where behind-side layers mostly activates on [SEP] token.

Most papers say that all following tokens looks up [CLS] token in Self Attention, so the output logit of the [CLS] token contains the semantic data of the input sentence.

YeonwooSung commented 4 months ago

BERT에서는 [CLS] ~~~ [SEP] 형태로 인풋 텍스트가 가게 되는데, 이 [CLS] 토큰은 어텐션 진행 시 다른 모든 토큰들이 이 토큰을 참조하면서 결과 값에서 [CLS]의 logit에 전체 문장의 의미가 함축적으로 포함되게 된다고 한다.

단, [SEP]의 경우 정확히 왜 뒤쪽 레이어들이 높은 참조율을 보이는지에 대해서는 24년 3월 현재까지 명확한 답은 없는 상황이다.

YeonwooSung commented 4 months ago

In TOWARD UNDERSTANDING TRANSFORMER BASED SELF SUPERVISED MODEL paper, the author stated that the main reason of the high activation for [SEP] is "No-Op".

The paper stated that if the attention layer think there is no extremely important token for sentence embedding, it just looks up the [SEP] token for no operation.