Open HyunkuKwon opened 4 years ago
I just love GPT2's unicorn story that I had to share it with so many friends (@nwrim 👋) when I first learned about GPT2 ❤️ It is definitely fascinating to see how well a machine learning model with just a simple training condition can do with enough data and computing power.
However, this also leads me back to the question of whether AI can really achieve a human-like performance. GPT2 processed much more texts than any human would be able to read in a lifetime. GPT2 encountered a diversity of topics and contents that is probably a lot more than an average human would encounter. But even then with all these data and computing resources, it still can't beat humans in all tests.
How can we make GPT2 be able to take better advantage of available resources? Can we actually say that all we need is a general, unsupervised language model when GPT2 obviously has much more advantage on data and model complexity than other language models? How far do we expect GPT2's performance to increase by increasing the number of parameters and training data? Many computational linguists are concern with the grounding problem in ML models. Is this what GPT2 lack? Does this mean even with so much data and model complexity, it is still not enough to solve the grounding problem? Then even with more data and parameters, I don't think GPT2 would be able to reach a human's level. What is lacking that could solve the grounding problem?
I cannot believe I am only learning about GPT2 now. I have been missing out on some good memes. My question is an applied one, and a radical one at that- to what extent are language models capable of innovation? They might just be type stuff that makes grammatical sense, but can they in rare occasions generate profound and original ideas, in the spirit of monkey typing Shakespeare? Of course, one follow-up question is do you have a measure of originality and an automatic way to test the originality of the newly generated text? I had a bit of fun fooling around this idea with GPT2. I was hoping GPT2 would tell me the cure for coronavirus, but it got political 0 to 100 real quick ...
"The cure for COVID-19 is no cure at all. It just makes it worse.
Climate reporters can blame Fox for their lack of diligence, but that just begs the question of why the people who report on and cover global warming would be so concerned about it. Perhaps because they know that there is no cure.
'Don't cover a story you know is not true,' was the tagline of a former ExxonMobil (XOM) official who resigned last year after pushing for a huge expansion of fossil fuel production."
Thanks, GPT2. Also, some observations: I run the algorithm by providing the same opening, but the output is different every time. We would think that "COVID-19" and "coronavirus" are interchangeable and describing the exact same thing, but from what I observe, the outputs using "COVID-19" tend to be more scientific, whereas the outputs using "coronavirus" tend to sound more like articles in the mainstream media. This is most likely due to the fact that documents in the scientific community are more likely to go with COVID-19 while coronavirus is the term for the general audience. Last, "A cure" and "the cure" also gave vastly different results. The former leads to proper outputs that mostly point to the fact that a cure is nonexistent at this moment while the latter generates BS like the passage I quoted. My prior tells me that this phenomenon can be attributed to the fact that the phrase, "the cure for COVID-19," can hardly be found in any serious reporting on the progress of the vaccine development. It is nonetheless a eye-catching phrase that probably does well click-baiting.
As a fan of great writers and as a person who almost has an irrational belief toward human capacity with playing with language, I always am skeptical of whether machines could learn languages to the level we have.
Specifically, what I think is truly amazing about human capacity in languages is that not only we can generate sentences that makes sense grammatically/semantically, but we can manipulate the topic or the purpose of the discourse, and adds subtle nuances to the language we speak/writes.
An arbitrary example will be: let's say a newspaper reporter J wants to write an article about a public figure K to make K look bad, but J cannot explicitly criticize K in harsh words. In this case, J can include carefully calibrated anecdotes that will make K look bad in the article and have a general tone of skepticism in the article.
I feel that even the state-of-arts techniques are not even close to actively create texts that have a pre-defined topic ("degrading K") or have a pre-set tone ("secretly criticizing"). GPT-2 article mentions that the model is able to mimic the tone of the input it was given, but I do not think that is in the same dimension as the thing I am talking about.
Were there any advances in this matter in the text-generating researches than "just train them in a different dataset with the desired tone"? I am highly interested in learning more about this!
It is exciting to read the stories generated by GPT2. My first question concerns the management of so many different kinds of tasks given just a simple model given the corpus. Is there any tradeoffs during the model training and prediction to handle the different forms of tasks? Or is it possible that the model is actually overfitted, but such a problem is "covered" by the extremely large amount of corpus (40GB) ? In other words, when the training text is ideally large enough, then overfitting may not be a severe problem, as the text covers almost every possible texts, and then the model approaches to a deterministic one. If true, it may to some degrees explain the dealing with multiple tasks for this model.
Another concern is about its policy implications. As mentioned in the article, the GPT2 could do the unsupervised translation of languages, while from my opinion, the successful translation between language needs the mastering of different languages, and the translation is still a difficult task even by manpower. I wonder when doing the language translation, will GPT2 also train the language other than English and could it be different? Also, it is a bit worrying to see that there are some potential malicious usage of the model, which makes me worry about when it is applied in the social sciences, or when it helps make policy decisions, to what extent should we rely on their outcome? How much should we "trust" the results by model? It actually depends on our trust on the model itself.
I think using learning methods (supervised or unsupervised, deep learning or reinforcement learning) to analyze and even "write" literature will be a huge progress in the development of NLP, since the different corpora and the writing style in a specific literature work varies. I just want to know more about the difference of RNN and NN models. My basic understanding is that RNN has more layers of memory, which can introduce more parameters and transition to deal with the inputs. I wonder under which occassions should we use RNN or NN? Or should we just use both of them and compare the results so that we can pick one?
Given that there are 5 common activation functions in neutrons: step, sigmoid, Tanh, ReLU and Leaky ReLU, I saw somebody point out a big drawback of RNN is vanishing gradients (see https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0). Why would data scientists adopt the Tanh function in the hidden layers instead of better functions like ReLU?
I also see some people pointing to a fact that RNN is limited to looking back only a few steps though theoretically they could make use of information in arbitrarily long sequences. If that is the case, what makes it better against Word2Vec in natural language processing? Under what circumstances are these two models should be preferred respectively?
The blog provides very interesting examples such as generating Shakespeares’s writings, markdown text, LaTex codes and etc. My questions are somewhat general.
First, I’d like to know more about the difference between LSTM and RNN and in what kind of situations one learning algorithm gains an edge of the other - the author mentioned that there is some caveat to notice but did not go much into the details and ultimately uses these terms interchangeably.
Second, in a recent political science colloquium I attended, a working paper employed BERT and CNN to perform joint image-text classification of tweets and I think it is a very rare practice of BERT in academia compared to the much wider application in industry. (Of course, please correct me if I am wrong.) So I am wondering what features make scholars less favorable of BRET compared to RNN?
For Deep Learning
I am always confused about how could we properly interpret the hidden layers in the neural networks model. This paper and the second blog give me some insights about how we should treat the layers in the RNN model (a probability tuner for each input character or word), but I am still not sure how the layers interact with each other (or they are independent). In addition, I totally have no idea how we could explain the layers in the CNN model. The paper claims that the layers could be interpreted as different learned features. However, how could we use human language to translate the machine learned features?
For The Unreasonable Effectiveness of Recurrent Neural Networks
I found it amazing that RNN can learn to create seemingly reasonable output after iterations of trainings. However, I think most texts generated by RNN still don't have any real world meanings. For example, as the author stated, after 2000 iterations an RNN model was to produce this:
"Why do what that day," replied Natasha, and wishing to himself the fact the princess, Princess Mary was easier, fed in had oftened him. Pierre aking his soul came to the packs and drove up his father-in-law women.
What's the real world application of such training? In the field of text generation, what do we aim to achieve with RNN?
It seems that the text generation quality of GPT2 is astonishingly good. Any metric to evaluate unsupervised learning output?
Super fascinating read, but I think I'm still confused about how RNNs work at a high level. To frame a specific question, in the blogpost there is a figure that shows input chars -> input layer -> hidden layer -> output layer -> target chars. What exactly is this hidden layer, and what is it doing?
The RNN model and GPT-2 reminds me of the story generators that we play with. While using, I was thinking about this could be used to generate political bots or tram emails. I'm glad to see the authors of GPT-2 is bringing up the ethical issues.
My question is for Deep Learning: In Conclusion, they predict that "unsupervised learning to become far more important in the longer term." What is the progress in unsupervised learning? Is there any specific advantage of unsupervised learning over supervised learning, except for the "nature of human learning"?
Interesting introduction! My questions are about the application of these deep learning method:
Karpathy, Andrej. 2015. “The Unreasonable Effectiveness of Recurrent Neural Networks”. Blogpost.
It is interesting to see in this blog that implements RNN to generate text character by character. Although author highlighted the effectiveness of text generation by using RNN, how can we evaluate the outputs of RNN as successful? Another question is that thinking of applying RNN to a larger scheme. I am wondering whether we can apply the model trained for RNN to a more macro level which might be more efficient and helpful for macro issues?
Even the smaller version of GPT-2 is powerful enough to compose interesting prose, and it’s easy to see how the more powerful versions write potentially malicious and convincing fake news so I'm curious how we could potentially prevent this or have some sort of checks and balances system for the model besides only releasing smaller versions of the model? Like some sort of in-built/ hidden pattern embedded that if tested would reveal that it is GPT-2 model generated as opposed to something real?
Also, is there a potential worry of the output being too formulaic perhaps? Based on playing around with the demos available a bit, it seemed quite amazing how it could piece together the information to create a narrative but yeah could potentially be too formulaic? I understand that this is just an illustration and we would use the methodology to build other applications though so might not be that important a concern?
One of the malicious purposes of GPT-2 is generating fake content on social media. However, this is already a prominent issue to solve. Since the language generated by the machine is still different from the human ones, is it possible to use GPT-2 to quantify the likelihood of being fake for online content?
RNN seems to work well on sequence data, and it is also mentioned that non-sequence data can also be trained in the format of sequential data, and is there any meaning for we to do so?
LeCun, Yann, Yoshua Bengio & Geoffrey Hinton. 2015. “Deep Learning.” Nature 521: 436-444.
Karpathy, Andrej. 2015. “The Unreasonable Effectiveness of Recurrent Neural Networks”. Blogpost.
OpenAI. 2019. “Better Language Models and Their Implications”. Blogpost.