Paper Review: BERT: Pre-training of deep bidirectional transformers for language understanding

Publisher

NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference

Link to The Paper

https://arxiv.org/abs/1810.04805

Name of The Authors

Devlin, Jacob; Chang, Ming Wei; Lee, Kenton; Toutanova, Kristina

Year of Publication

2019

Summary

This paper presents a new language representation model named BERT: Bidirectional Encoder Representations from Transformers. It was the state of art for many NL understanding tasks (GLUE) and question-answering (SQuAD). Unlike famous GPT models that use causal language modelling (CLM), BERT introduces masked language modelling (MLM). MLM corrupts 15% of the input tokens where 80% of them are replaced with a [MASK] token, 10% are replaced with a random token, and 10% are kept as is. BERT also uses a next sentence prediction (NSP) pertaining approach that significantly improves performance in question-answering. With their Abalation Study, the authors showed that BERT is useful both for the fine-tune approach and feature-based approach (where a number of layers are kept as is and some newly added layers are trained on top of that).

Contributions of The Paper

They demonstrate the usefulness of bidirectionality in pretraining. To enable bidirectionality, they use the MLM task in pertaining instead of CLM (used in GPT).
Showed that good pertaining is sufficient for many downstream tasks, eradicating the necessity of heavy-engineered task-specific architecture.
Created (which is already over-turned), new SOTA scores for 11 NLP tasks. See current leaderboard here.

Comments

No response

RAISEDAL / RAISEReadingList