NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
Devlin, Jacob; Chang, Ming Wei; Lee, Kenton; Toutanova, Kristina
Year of Publication
2019
Summary
This paper presents a new language representation model named BERT: Bidirectional Encoder Representations from Transformers. It was the state of art for many NL understanding tasks (GLUE) and question-answering (SQuAD). Unlike famous GPT models that use causal language modelling (CLM), BERT introduces masked language modelling (MLM). MLM corrupts 15% of the input tokens where 80% of them are replaced with a [MASK] token, 10% are replaced with a random token, and 10% are kept as is. BERT also uses a next sentence prediction (NSP) pertaining approach that significantly improves performance in question-answering. With their Abalation Study, the authors showed that BERT is useful both for the fine-tune approach and feature-based approach (where a number of layers are kept as is and some newly added layers are trained on top of that).
Contributions of The Paper
They demonstrate the usefulness of bidirectionality in pretraining. To enable bidirectionality, they use the MLM task in pertaining instead of CLM (used in GPT).
Showed that good pertaining is sufficient for many downstream tasks, eradicating the necessity of heavy-engineered task-specific architecture.
Created (which is already over-turned), new SOTA scores for 11 NLP tasks. See current leaderboard here.
Publisher
NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
Link to The Paper
https://arxiv.org/abs/1810.04805
Name of The Authors
Devlin, Jacob; Chang, Ming Wei; Lee, Kenton; Toutanova, Kristina
Year of Publication
2019
Summary
This paper presents a new language representation model named BERT: Bidirectional Encoder Representations from Transformers. It was the state of art for many NL understanding tasks (GLUE) and question-answering (SQuAD). Unlike famous GPT models that use causal language modelling (CLM), BERT introduces masked language modelling (MLM). MLM corrupts 15% of the input tokens where 80% of them are replaced with a
[MASK]
token, 10% are replaced with a random token, and 10% are kept as is. BERT also uses a next sentence prediction (NSP) pertaining approach that significantly improves performance in question-answering. With their Abalation Study, the authors showed that BERT is useful both for the fine-tune approach and feature-based approach (where a number of layers are kept as is and some newly added layers are trained on top of that).Contributions of The Paper
Comments
No response