Deep Classification, Embedding & Text Generation - OpenAI 2019

jamesallenevans commented 4 years ago

OpenAI. 2019. “Better Language Models and Their Implications”. Blogpost.

ziwnchen commented 4 years ago

This is truly amazing! I am shocked by what the current technology has progressed towards a general language model, or even further, an Artificial general intelligence. As the blog mentioned in the last part, this kind of general language model will have a huge societal impact (e.g., automation of malicious content generation). This reminds me of recent news about Joseph Redmon stop doing cv research because of the potential negative social impact.
Is there any way that we could estimate the consequence of a potential AGI model in society? Is it possible to implement certain "rules" in the model design to prevent them from generating malicious content? One possible example is that people want to implement moral standards in machine intelligence: The Moral Machine experiment.

lkcao commented 4 years ago

This study, based on the unsupervised learning of a huge corpus, build an impressive AI that can handle multiple general tasks well. My questions are as follow: (1) How can a single network handle so many different tasks? (Is it the case that they change the form of the output layer?) (2) Can we say that the accuracy comes from the complex architecture of the algorithm and the huge corpus? As a result, can we say that they build a quite accurate AI at the cost of computational efficiency? (3) And how can we say the algorithm is 'unsupervised' since deep learning is often trained on real texts? (In the text, the authors say the model should estimate p(output | input, task) as well, so I guess they use output information when generating the p distribution).

arun-131293 commented 4 years ago

As impressive as the results may seem, using 40 GB of high quality/low noise text data for a language model to perform below par relative to humans is indicative of a fundamental difference in language learning. Additionally, as the authors point out, it's only effective on certain topics (that too after several tries for each text generation process due to the randomness involved which means some outputs for the same prompt are better than others) and not on others. In children, the language learning architecture is not initialized with a random state, while in Neural Networks begin with a random initial state. There have been ideas to import an human like initial state into Neural Networks, like this paper from Science details, so the learning process is not purely data driven but constrained by the initial state just like in children during their language sensitive age.

Have there been such efforts in Computational Social Science to import a initial state that represents a Sociological/Economic/Linguistics/Anthropological theory that will then constrain the model away from purely learning descriptive patterns from data but a model that will be informed by said theory while also make the learning more efficient in terms of data requirements(both in scale and quality)? How does transfer learning when using GPT-2 compare to this?

arun-131293 commented 4 years ago

This study, based on the unsupervised learning of a huge corpus, build an impressive AI that can handle multiple general tasks well. My questions are as follow: (1) How can a single network handle so many different tasks? (Is it the case that they change the form of the output layer?) (2) Can we say that the accuracy comes from the complex architecture of the algorithm and the huge corpus? As a result, can we say that they build a quite accurate AI at the cost of computational efficiency? (3) And how can we say the algorithm is 'unsupervised' since deep learning is often trained on real texts? (In the text, the authors say the model should estimate p(output | input, task) as well, so I guess they use output information when generating the p distribution).

Because what we are learning is not only a language model that can generate text given some initial text but also 1) contextual word embeddings taking into consideration the context(some architectures can use both right and left context) the word appears in; so homonyms like pole("she pole vaults" vs "he is a pole") and coke("drank" vs "snorted") have different embeddings depending on the usage. 2) Vector(s) that captures the semantics of the entire text/sentence. So you can give in your text and get both the corresponding word embeddings or the state that captures "meaning" of the entire text. Most NLP tasks' performance seems to improve when such information is used.
Unsupervised because we are using non-coded data(text as it was written, with no other information added).

jsmono commented 4 years ago

It was shocking to see the samples GPT-2 generated as it even included fictions and essays, both of which seemed to require substantial knowledge and creativity that only human beings can obtain. I'm wondering how long it takes the program to generate texts like this and how challenging it is to build a similar model. On the other hand, the authors suggested that GPT-2 can conduct translations between languages which confused me a little bit. Translation requires mastering two languages. Since the current samples only applied English, how can GPT-2's translation be applied to another language? Even if there are two versions of GPT-2 based on different languages, how can they be incorporated together to build meaningful and accurate translations of the texts?

ccsuehara commented 4 years ago

I see that this is the short version of one of the optional readings of the week. My question is if wether the unsupervised model is powerful in its prediction due to the method itself or the data it was build on. Also, I see that the evaluation metrics they chose to present for each dataset may vary (accuracy in some cases, perplexity in others). Overall, it seems to be too good to be true, but then, the implications to this are very real (like the fake paper generator example)

katykoenig commented 4 years ago

I appreciate that this work is very mindful of abuse that could result from their powerful model. On a specific note, if personation is now more possible using these algorithms, do they need to be trained on a specific person's speech transcriptions to pick up their accent, colloquialisms, filler words, etc.?

On a different note, the post uses perplexity as a measure of the strength of their model, which I haven't seen before. Could you speak more about this? When is this is an appropriate metric with which to judge a model (as opposed to accuracy, precision, recall, etc.)?

bjcliang-uchi commented 4 years ago

As many of others have pointed out, I am quite impressed by how this team is aware of the political and general social impact of this algorithm. So please pardon me, as a former public policy student, to ask about how OpenAI works with industry and policymakers to make algorithm positive for social good, and how to understand this policy framework.

deblnia commented 4 years ago

GPT-2 is really fun. I've had friends generate cover letters and make twitter bots using the tool, and I've read at least some research that puts it into practice, but I'm also struck by the awareness this team has of the political and social implications of their work. I'm not sure how to ask questions about the model itself, as it hasn't been released. How do academics generally deal with a model that qualitatively seems to work, but can't be directly evaluated?

laurenjli commented 4 years ago

I appreciate the authors' discussion of malicious use of this technology and steps that they are taking to try to mitigate them. How can one go about handling or managing biases that are inherent in text but that shouldn't be outputted in applications?

di-Tong commented 4 years ago

I wonder if you could specify a bit more on how exactly GPT-2 is trained, as it seems much more powerful than RNN in handling multiple prediction tasks, yet the article does not say much about the underlying processes that make it happen. What are the variables? How to tune variable number? What does the input structure look like? What is the relation between this model and neural network?

wunicoleshuhui commented 4 years ago

I did find the authors' discussions of the political and social implications of GPT-2 (and other similar models) quite informative and alarming, specifically the issue of impersonating actual persons online. I'm wondering from an ethical standpoint, how well could such models accomplish these malicious actions at the present, and are there possible tools or other models that can detect whether certain content is automated?

luxin-tian commented 4 years ago

I am also curious about how could a general model almost perfectly handle such many different kinds of tasks? Is there any trade-off like the bias-variance trade-off in models like this? Suppose the training corpus has too many distinct features than the general language has, can the model still be good at general tasks like those in this blog? Is there anything like "overfitting" that can result in weak prediction power due to too much "noise" being learned by the model from the training corpus?

sunying2018 commented 4 years ago

It is so amazing to see the power of GPT-2 model which can achieve so many tasks. As mentioned in this blog, "all without task-specific training", I am really curious about how this model can achieve that even though this large model with 1.5 billion parameters. Besides, among the huge number of parameters, how can we evaluate the efficiency of these parameters ? Another question is similar as @katykoenig , I noticed there is one quite new evaluation metric "perplexity", why they use this specific metric rather than other more common metrics?

chun-hu commented 4 years ago

This is quite impressive work and I appreciate the authors' awareness of potential malicious political and societal impacts. I'm especially impressed by the number of parameters (1.5 billion) to train the dataset. Will it be computationally expensive and inefficient? Also, I'm wondering whether the algorithm can be used to predict or even creat internet languages since the training data is obtained from millions of webpages?

skanthan95 commented 4 years ago

What does "perplexity" mean, in the evaluation table? The authors don't seem to have explicitly operationalized it in the article. Additionally, how does one go about comparing previous record scores to present performance? More specifically: for the LAMBADA dataset, the previous record was 99 and the present performance was 8.6, and a decrease in score is positive. But, where does an 8.6 stand in relation to a 99? What does this gradient represent, notch by notch?

rkcatipon commented 4 years ago

I was really impressed with how the GPT-2 handled language ambiguity, such as with the common-sense reading test and the proper identification of the subject of "it". From the linked technical reading of the model, I was also surprised to see that the training set included multi-lingual sentences, such as:

”I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool].

And that the model did quite well in translation. As translation is often more art than science, my question was how much of the semantic nuance was perhaps lost in machine translation and is there was some way to quantify that loss?

YanjieZhou commented 4 years ago

From my experience with the big data, I find its charming advantages but also computational expensiveness or even a waste of resources. Considering the size of the data is so large that its order of magnitude reaches billion, are there any other ways that cost less computational resources but are able to produce the similar results so that the process can be more replicable?

adarshmathew commented 4 years ago

These are great results, and I can see their application in products, given the focus on text prediction. I'm a little unsure about how we'd use this to interrogate corpora of text. To illustrate, the 'Geometry of Culture' paper by Kozlowski, Taddy and Evans constructed scales/dimensions with word embeddings to understand how we perceive of cultural terms. How could one use GPT-2 to do the same?

The larger question goes back to the inherent trade-off between interpretability and performance of these models -- they're great at prediction and generalize well, but without understanding their conception of language (and how that differs from our conception and use of language), what would be the best way to leverage these methods for sociological research?

sanittawan commented 4 years ago

I have a similar concern as @laurenjli after reading the article, especially when putting it in the context of Caliskan et al's reading. It seems that GPT-2 is really capable of imitating human; however, I wonder how researchers are tackling the issue of biases in these models.

meowtiann commented 4 years ago

This is a word prediction compared to the other article of single character prediction. It does way better job producing meaningful sentences and even plots.

I can clearly see the 'malicious' use of this method that made developers super cautious. Politicians or government or activists groups can use text-generators to destroy social media space with a click. But actually this happens in Chinese social media space quite often in the past decade. Once a forum decided to attack another forum, the forum members are mobilized to flood the other forum with prewritten texts on their fake accounts, and because it is faster to post than to delete, the other forum will be full of inappropriate content for a long time. Now the social media space is less of forum-based community, more of updates and comments community. So instead of turning off comment section or delete anti-voices in comment section, it is more likely to alter comment section on a mass scale with this natural language generator.

Another use is just app store comment. Some app ratings are full of nonsense but with a 5 star. The language generator they are using right now does not produce any meaningful sentences so I can still detect them. But with this level of tool, every app in app store will be 5 star with reasonable comments. This will kill many rating systems.

kdaej commented 4 years ago

This article briefly mentions possible misuse of the language model to generate texts. I wonder what might be some legal issues or social problems that can be caused by such technology. Also, it seems inevitable that people will have access to such models although the authors decided not to share their codes. How could we address the related issues?

VivianQian19 commented 4 years ago

I’m also struck by how closely the content generated by GPT-2 conditioned on the given prompt mimic the quality of human text as well as the model’s performance on domain specific language modeling tasks. The model also has great performance on language tasks such as reading comprehension etc. It it true that the application of these kind of large general language model can be beneficial to tasks such as dialogue agents and AI writing assistants and building better speech recognition system; at the same time, however, I’m also concerned about the implication of malicious application of these kinds of general language model if they fall into the wrong hands. I think this article gives a great starting point to think about the moral implications of AI. This seems to also be an issue in the legal field where the laws and regulation are constantly evolving to adapt to the advancement in technology. I wonder if there are any studies or research that study these questions and how are these kinds of studies designed?

cytwill commented 4 years ago

It feels amazing to see efforts put on training such big corpora. I think there is no wonder why the model could perform well on many text-related tasks since they are trained with so many texts, also with millions of parameters. However, what I expect to see is that if there are any possibilities so far for deep learning models to use a relatively simple structure and corpora of a smaller size to finish content-related tasks with high accuracy. Because generally, common users or studies might not use so many texts and possess computation capacity, so having a "small" but functional might be more helpful in these cases.

Computational-Content-Analysis-2020 / Readings-Responses

Deep Classification, Embedding & Text Generation - OpenAI 2019 #38