eg-nlp-community / nlp-reading-group

12 stars 0 forks source link

[08/03/2020] [3pm GMT+2] BERT Rediscovers the Classical NLP Pipeline #2

Closed hadyelsahar closed 4 years ago

hadyelsahar commented 4 years ago

Name BERT Rediscovers the Classical NLP Pipeline

Link https://arxiv.org/abs/1905.05950

Abstract:

Pre-trained text encoders have rapidly advanced the state of the art on many NLP tasks. We focus on one such model, BERT, and aim to quantify where linguistic information is captured within the network. We find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference. Qualitative analysis reveals that the model can and often does adjust this pipeline dynamically, revising lower-level decisions on the basis of disambiguating information from higher-level representations.

Join using the link: https://hangouts.google.com/group/nBpy71nXxHxAovBw5

hadyelsahar commented 4 years ago

ACL short paper that seems to be a follow up of this ICLR paper by the same authors: https://openreview.net/pdf?id=SJzSgnRcKX

As omar and ibrahim said, the paper doesn't show surprising results, tasks such as pos tagging can be solved better by layers first layers while more semantic global information can be solved using higher order layers.

I find figure 1 not so intuitive to understand compared to figure 2.

one comment on figure 2 was if layer 1 in pos tagging does all the heavy lifting as shown from the e differential scores, why then it doesn't have the majority of weights in the mixing scores

image

Although paper reports results on BERT base and Bert large would those results be consistent if we train for example 10 different BERTs with different initializations? I find it to be a bit self-centred to assume that bert learns the same features that we humans assume are crutial to language, is that the case or are we tea leaf reading ?

image

slides from cho talk in emnlp2019 https://drive.google.com/file/d/1HGzv6n9hAj-GL63POUZCO6nCrIHF9y35/view

ibrahimsharaf commented 4 years ago

I found the paper intuitive and somewhat easy to get through with, the findings (that BERT lower layers are responsible for solving syntactic tasks (POS, dependencies, constituencies) and higher level layers are responsible for solving semantic tasks (relations, coref)) are not surprising (although the idea needed to be proved, that's what they have done in the paper).

The only downside is that figures (1 & 2) are not very easy to understand, in figure 1 I didn't understand what's the point of the blue bars (center of gravity layers weights), and in figure 2 as @hadyelsahar mentioned, I didn't understand the blue bars (mixing weights).