[펭귄] (FQ) BERT의 Masked Language Model의 단점

논문 Introduction 부분에 다음과 같은 문단이 있습니다.

However, the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language

BERT에서는 중간 중간에 [MASK]가 포함되어 있기 때문에 AR Language Modeling과 같이 Product Rule을 사용하여 결합확률분포를 모델링할 수 없다.
단순하게 생각하면 [MASK]는 인위적으로 만들어낸 토큰이고 pretraining 단계에서만 등장하기 때문에 finetuning과정에서 괴리(pretrain-finetuning discrepancy)가 발생한다.

참고 자료

[Review] XLNet (with Code)

boost-devs / peer-session

[펭귄] (FQ) BERT의 Masked Language Model의 단점 #81

❓ 질문 내용

📄 참고 자료

참고 자료