Open HyunkuKwon opened 4 years ago
A Primer in BERTology
The paper suggested that redundant heads and layers in BERT may result from attention dropouts. However, I am quite confused about why doing would increase redundancy instead of decreasing. Then could we avoid attention dropouts at all or doing fewer attention dropouts? What would be the tradeoff to do so? Alternatively to BERT compression (or any compression that is done after the model is trained) to solve such problem, is it possible to have some dynamic redundant heads selection while training?
Also related to that, should we want units in BERT to encode non-overlapping/unique information. A friend I know is working on each unit in VAE coding unique information, so I wonder if this is also desirable/preferable in BERT.
Finally, do you think it is possible and beneficial to additionally train BERT at the character level (on top of the word level it is now)? I have read papers on simple generating models that actually are trained on character level and still can perform as well (if not better) as models trained at word level.
For Rogers, Kovaleva, & Rumshisky (2020):
Although highly technical and hard to understand for me, I think the general tone of the paper is that BERT itself is something that cannot be comprehended fully yet, and is undergoing a lot of investigation. This reminds me of the recent discussions in academia about AI as the target of research as well as the end product of research - in other words, the behaviour of AI should be studied as if it was a being since even the creator of such machines cannot fully understand what is going on. (If anybody is interested, Nature article "Machine Behaviour" is a very fun read)
This brings me to the question of using highly complicated neural networks in the realm of social science. I feel like it is okay to just use whatever that performs best in more engineering or CS kind of fields but in social science, the emphasis is on understanding what is happening rather than just knowing something is happening. To hide behind a Giant's back (fits the situation better than the common expression), Hopkins & King (2010) said that "Policy makers or computer scientists may be interested in finding the needle in the haystack (...) but social scientists are more commonly interested in characterizing the haystack." (p.230).
Given this, I wonder how we can "sell" the use of BERT or RNN or any other complicated neural networks in more social science settings. Even the exemplary reading for this week seems to be in highly computational space rather than more social science space.
For “Bert: Pre-training of deep bidirectional transformers for language understanding.”
It is a truly technical paper, and my first concern is on the worthiness of developing a mechanism based on the bidirectionally. The motivation of this idea is quite clear, but it may require double the workload as did in the unidirectionality. Also, for many models dealing with characters from just one direction, their performances are also not not bad, even quite great. Therefore, I wonder if the improvement of bi-direction is truly worthy to proceed?
Another question is on the "self attention mechanism" mentioned in the paper. I wonder how it works during the processing of BERT, and what it is in detail? From my understanding, it refers to the "attention" on certain words / characters in the sentence by giving different weights to the words, but there is no much content on this mechanism in the paper.
For Bert: Pre-training of deep bidirectional transformers for language understanding
I kind of think that the focus of deep learning is learning the representations of sentences, and the success of BERT shows that large-scale data is very important in NLP. Also, I noticed that pre-training is very important in BERT. However, I think the computation cost of BERT is still high. I wonder is there any solutions to this problem? Or we can just rely on super computers?
For Bert: Pre-training of deep bidirectional transformers for language understanding
BERT is really a fascinating NLP technology which proves that a “deep” model can significantly improve the accuracy of the unsupervised learning results.
It is discussed that we can continue exploring the potentials of BERT on unlabelled data not just limited to content analysis, so I wonder what areas can be suitable to apply such an algorithm.
Another concern is that it seems BERT, as a "costly" algorithm, places high requirements on hardware so that it can hardly be run on personal computers. (Although I never tried BERT myself, I implemented LSTM on one of my projects and the programmings eat a lot of memory.) So I would like to know if this obstacle hinders personal large-scale analysis using BERT?
For A Primer in BERTology: What we know about how BERT works.
From this paper it seems that BERT is still under experiments, with a lot of puzzles emerging on the way. My confusion is why overparametrization of BERT may even ruin the results. (Though I know that this is also a problem under investigating, I just can't find an intuitive interpretation. From my shallow understandings of machine learning, when overfitting problem becomes considerable, the in sample fit should still be good.) Similar to @wanitchayap, I am also wondering what is the rule of attention dropouts in BERT.
For Applications: Natural Language Processing
Given deep learning's demand for high GPU performance, what are some of the common services that social scientist use to perform large-scale Natural Language Processing?
It seems that BERT is an embedding model with very good performance. I wonder if this has anything to do with its corpus (entire Wikipedia and a book corpus)? And why do most studies still use GloVe or Word2Vec?
After reading Goodfellow's application: natural language procession, I am interested in the two different strategies of reinforced learning: exploration and exploitation mentioned in the recommendation system. My general question is: since the authors suggested that exploration is more suitable for unsupervised learning, how can we make sure that we are applying a right strategy and are learning a right pattern? If I am recommending a new course for a medium level learner, and kind of feedback is needed for better recommendation?(what kind of new data should we deliberately seek out to collect?)
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.
The two-step workflow of BERT: : feature-based and fine-tuning allowed task-specific attention to be translated into the model. This question surfaced as me trying to understand the design of BERT:"A distinctive feature of BERT is its unified architecture across different tasks" Why is it so important to keep consistet architecure across tasks?
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv:1810.04805. (Also check out: “Deep contextualized word representations” by Peters et al from AllenAI and the University of Washington).
I tried my best to understand this quite technical paper. The bidirectional way author proposed is interesting and I am really willing to know that although bidirectional model might be much more accurate then the unidirectional, however, I am wondering is it really cost-effective? Also, to be more general of this question, do we have some languages that use unidirectional while others use bidirectional? I am very looking forward to know more about it of the application of NLP in different languages.
It seems that the new NLP models, such as BERT, are proud of being not task-specific. However, when applying in social science content, tasks play a role as the underlying data generating processes are different. Is it something we should consider when applying these models into social science research?
My general question is: since the authors suggested that exploration is more suitable for unsupervised learning, how can we make sure that we are applying a right strategy and are learning a right pattern?
Post questions here for one or more of our fundamentals readings:
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.4 “Applications: Natural Language Processing.” MIT press: 456-473.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv:1810.04805. (Also check out: “Deep contextualized word representations” by Peters et al from AllenAI and the University of Washington).
Rogers, Anna, Olga Kovaleva & Anna Rumshisky. 2020. “A Primer in BERTology: What we know about how BERT works.” arXiv:2002:12327.