eg-nlp-community / nlp-reading-group

12 stars 0 forks source link

[15/03/2020] 6pm GMT+2 - A Primer in BERTology: What we know about how BERT works #3

Closed hadyelsahar closed 4 years ago

hadyelsahar commented 4 years ago

Name:

A Primer in BERTology: What we know about how BERT works

URL: https://arxiv.org/abs/2002.12327

Abstract:

Transformer-based models are now widely used in NLP, but we still do not understand a lot about their inner workings. This paper de- scribes what is known to date about the famous BERT model (Devlin et al., 2019), synthesiz- ing over 40 analysis studies. We also provide an overview of the proposed modifications to the model and its training regime. We then out- line the directions for further research.

Join us through hangouts: https://hangouts.google.com/group/kUxBAunjGittAkBUA

Omarito2412 commented 4 years ago

This paper is a survey of all research that works on interpreting BERT and how it works. It looks like BERT has been analyzed in many directions, among which, syntactic, semantic, as a knowledge base,...

The cool thing about the paper is how people found patterns and formed hypotheses about the way BERT works, although, they might not support it with the strongest evidence, it seems reasonable.

It can be considered as a good resource for understanding BERT and how it works. I think for future work there should be more surveys on why BERT fails or what BERT is short of.

hadyelsahar commented 4 years ago

An interesting paper collecting all previous bert work the paper serves hub for all work adapting bert in the past year. From the paper I find 3 intriguing questions:
1- middle layers are effective to solve down stream tasks? what is the use of adding more layers 2- models don't make use of all their parameters? this can be demonstrated by pruning, while this is a problem in most of DL models how can one find better architectures and training methods that forces the model to make use of all available parameters.
3- what are the new datasets / tasks that one can create to test what BERT shortcomings

Examples of work discussing #3:

Probing Natural Language Inference Models through Semantic Fragments https://arxiv.org/pdf/1909.07521.pdf

BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA https://arxiv.org/abs/1911.03681

mentioned also during the discussion the NLP's Clever Hans blogpost https://bheinzerling.github.io/post/clever-hans/

Ragabov commented 4 years ago

That was a very interesting read, I do agree with @hadyelsahar that this should be the second paper to read for someone to get introduced to BERT, after the original BERT paper.

What I found the most intriguing inquiries were as follows :

1- It's mentioned that there could be some significant redundancy in the model's learned parameters, one hypothesis that I found very interesting, was that this is a result of applying dropout to the attention layers.

2- Although some semantic and syntactic knowledge is somewhat encoded in the model's parameters, it's shown that the model isn't really utilizing this knowledge to it's best.

3- The survey manages to shed a spotlight on the short-comings of BERT like the ones I mentioned in the two points above, yet, what is the best direction to move forward, surely just training bigger models doesn't solve the deficiencies.

I do believe that the next breakthrough in NLP would be the result of better training datasets and tasks, by training the model on more complex tasks (enabled by better datasets), one could hope that the shortcuts available for the model to take would be less than those available for a trivial learning task.

The ideal situation would be a huge dataset that allow the multi-task learning of several complex objectives (e.g ones that require logical reasoning) that would hopefully incapacitate the model's ability to "hack" the solution.

ibrahimsharaf commented 4 years ago

A great survey paper to get up to speed with BERT, what mostly grabbed my attention is the conflicting evidence between which layers work best at syntactic and semantic tasks (in section 5.2 BERT layers), some papers conclude that syntactic information appears earlier while high-level semantic features appear at the higher layers, while other papers suggest that the middle layers perform best.

Another point is the suggested research directions (section 9.2), especially the first part about building new datasets that require verbal reasoning, a very interesting follow up blogpost on this matter (https://bheinzerling.github.io/post/clever-hans/).