Open HyunkuKwon opened 3 years ago
These papers helped me better understand what's going on under the hood in BERT (and why it takes such a long time to run on my laptop!). Rogers et al. (2020) de-mystified for me some of the black-box haziness surrounding BERT. I'm interested in hearing about applications of BERT in public policy. How could BERT be leveraged, for example, to flag students most in need of educational support?
In the first week, you mentioned that SVD on a word embedding can outperform an LDA topic model. I think that may have been a reference to "discourse atoms." Could you provide some detail on this, or other ways in which neural networks can perform topic modeling?
Also, are discourse atoms available with genism or another accessible library?
The authors of the paper argue that the main limitation of existing techniques is that the standard language model is unidirectional, which makes the types of architectures that can be used in the pre-training of the model very limited. In the paper, the authors improve the approach to architecture-based tuning by proposing BERT, a bi-directional encoding representation of the Transformer.
I’ve heard about BERT a lot but have not seen many details about it before. Apparently, BERT is really doing great in many aspects. However, it does require so many computational resources, like other NLP models. So, instead of buying more computational equipment, are there any effective algorithms that can help us greatly optimize the computational process? Or are there any great ideas that could facilitate better implementation of BERT, even if they have not been formalized?
Can we possibly infuse pragmatic inferences into the self-attention heads? Although some pragmatic inferences are even hard for humans, the others might be easy enough to train from context corpus. I wonder if BERT models can adapt to simple pragmatic inferences eventually, given that we provide a huge corpus (e.g. Wikipedia + news databases + some social media texts, for English BERT).
1) BERT and other transformer-based models have dominated NLP benchmarks for the last two years. In which areas of social content analysis do you see RNNs as still somewhat competitive/hold an edge, if at all?
2) What are some transformer models that can get around BERT's 512 token limit? This seems especially important as a lot of social content analysis involves large corpuses.
Where do we go from here? In other words, is BERT (or at least the architecture of deep learning it represents) the future of machine learning/computational sociolinguistics, with only minor refinements for the foreseeable future, or does it represent just one step into a whole class of possibly even better deep learning algorithms?
Could you clarify more about the pre-training and fine-tuning? And also, could you introduce more on how such unsupervised learning can be used for the topic modelling?
It's interesting to see an overview paper of BERT model. While as the review points out, BERT cannot reason the relationship between properties and affordance. It can, however, guess the relationship. I'm not sure whether the reasoning scale is really important in this area. Usually, stereotypes is used to infer reasons, is there any new way to deal with this problem? Or, are there any other directions we can focus on to supplement this issue?
In reading 1, Goodfellow et al., the authors still focused on RNN, while at present we see RNN is completely taken over by transformer based structures such as BERT. That's only over the past 4 years. The research cycle in AI and deep learning iterates really fast. What do you think is the key feature of transformer architecture that enable it to beat the traditional RNN? Thanks!
Is it generally true BERT requires more computational power than other methods?
This week we learned a series of deep learning models such as BiLMs, ELMo and BERT models. They work well on some prediction tasks and may work less well on others. How should we choose them? Also, I am a little concerned about the lack of interpretation of the hidden layer in deep learning methods. I think the more complex the model is, the harder it is to interpret model. What do you think about this question?
BERT is slow to train. Are there any alternatives? Or BERT is always better given its excellent performance?
Given BERT's computational cost, how widely accessible is this method beyond academic research institutions and tech companies?
Are there any other concerns of BERT besides its low speed? How to choose between various deep learning models?
Given deep learning's demand for high GPU performance, what are some services or applications that social scientist use to perform large-scale Natural Language Processing?
It was interesting to read some efforts from Rogers et al. to understand the “why” of how BERT works. For social science research, what kinds of “why” questions are important to answer, and in what cases can prediction be useful on its own?
As several other people mentioned, one of the biggest limitations seems to be performance. In general, do you think the limitations on the hardware side are crippling innovation on the software side?
Despite how good it claims to be, is BERT (or other giant transformer-based pre-trained network) actually useful in industry/for productivity purposes? Otherwise, those claims to fame about accuracy and benchmark really do not mean much.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. Chapter 12.4 “Applications: Natural Language Processing.” MIT press: 456-473.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv:1810.04805. (Also check out: “Deep contextualized word representations” by Peters et al from AllenAI and the University of Washington).
Rogers, Anna, Olga Kovaleva & Anna Rumshisky. 2020. “A Primer in BERTology: What we know about how BERT works.” arXiv:2002:12327.