Open jamesallenevans opened 4 years ago
It is particularly fun to see examples of using RNN to generate "fake" Shakespeare, Wikipedia and Linux source code, etc. However, it also comes to me that RNN seems to be better at identifying and replicating "formats" (i.e., space between monologues, the format of writing formula) than contents (i.e., omit proof, generate code and graph that does not make sense ). Moreover, as a black box deep learning model, it's hard to interpret what kind of rules/patterns the RNN model learns in application.
Under this circumstance, how to convince the social science audience that RNN is a suitable method? I noticed that this week's exemplary articles are very technical. Does that mean the application of RNN to social science research is still limited? If not, any examples in using RNN to solve problems of social interests?
My question is similar to @ziwnchen. I notice that while RNN produce some interesting and stylish texts, their meanings are often quite ambiguous and cannot be intepreted very clearly. Could you provide some articles that apply RNN in social science studies?
Karpathy's post shows both the advantage and disadvantages of Neural Nets. While it is true that adding a memory component to Neural Networks improves its ability to learn patterns in temporal/sequential data, it also increases the opacity of the inner workings of the learning(training) procedure as @ziwnchen has alluded to.
Karpathy's idea to show the excitations of certain neurons is fine, but it can lead to a common misunderstanding which is important to avoid when trying to interpret black box systems. There is a distinction between showing that there is some reaction (neuron excitation) to a certain kind of input (URLs) to showing that in fact that specific input is actually being actually used to make a decision. The conflation of the two is a common problem that famously occurred in cognitive psychology during the study of another black box: the mind.
Learning from the mistakes made in cognitive psychology, in recent times computer scientists trying to interpret Neural Nets have begun to "silence" the specific neurons that get excited for certain input to study the difference in the output this "silencing" creates(as opposed to just looking at neuron excitation patterns). In Karpathy's case, that means ablating away the neurons getting excited when URLs are encountered, to see if it makes it difficult for the language model to deal with URLs effectively. This method increases the interpretability of a Neural Network as we begin to understand the function of an individual neuron(but on the other hand, a neuron's behavior itself is a function of many parameters often called weights). Doesn't such a result point to an inverse relation between the number of neurons/parameters and interpretability? As we go from Word Embeddings to more complex architectures likes RNNs to even more complex ones like ELMO and BERT(which we are seeing this week), are we moving away from interpretability for the sake of better descriptive models that can capture more and more complex patterns? How to balance the two factors (memorizing more patterns vs interpretability), which seems to be in tension with each other?
The applications in this post were fun and while the author provided some commentary on the success of the text generated by the RNN, how do we systematically evaluate the outputs of a neural network? Obviously this is less of a problem for classification tasks as we can evaluate RNNs using standard supervised learning metrics, but for text generation, of course, we can read the output of text generated from the RNN to see if it makes sense to human beings, but that is not scaleable.
I'd like to know a bit more about the difference between RNN's and NN's. As @arun-131293 's post mentions, it adds a memory component. What does it mean doing so? In particular, they are using the LSTM (long short term memory, according to the post), are there any other kinds of it? The explanation on how RNN's work resemble so much to how NN's work, so i might not be getting the (possible subtle?) differences. Thanks.
I was initially intrigued by the section “Understanding what’s going on,” specifically the subsection “The evolution of samples while training”. While RNNs/LSTMs are often criticized for not being decipherable by humans, here seemed to be a way to get at how the model does it. Karpathy concludes: “The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words; First starting with the short words and then eventually the longer ones. Topics and themes that span multiple words (and in general longer-term dependencies) start to emerge only much later.” But is this not imposing our human intuition where doing so may not be warranted? Is it not just as possible that the algorithm discovers topics and themes early on, but is not yet able to express them in sequences of strings that make sense to humans?
I am wondering about RNN's sensitiveness to the length of sequences and the dependency among sequences. As the author says, to train an RNN as a Paul Graham generator,
We’ll train with batches of 100 examples and truncated backpropagation through time of length 100 characters.
This means that first, RNN cannot train long sequences and second, RNN train data often is often obtained by truncating data--adding covariance. How to understand these limitations?
The application to writing latex was fun and reminded me of this particularly, though I don't think Mathgen's implementation uses RNNs.
My question is about the Turing Completeness of RNN's. I've understood Turing Completeness to mean computer-like -- i.e. that something Turing Complete can do any computation. The author says:
In fact, it is known that RNNs are Turing-Complete in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.
Can we elaborate on universal approximation theorems for neural nets, and why we shouldn't read too much into this? Having a Turing-Complete algorithm (i.e. one that could generate code) seems to be really revolutionary.
This article is very interesting to implement RNN to generate text character by character, that is, character-level language models. I am thinking the possibility of implementing RNN on phrase level or more macro such as sentence. If we can train the models based on the character sequence of length k, compared with this character-level models, which one is more efficient and more accurate?
I somehow agree with @ziwnchen that "RNN seems to be better at identifying and replicating 'formats' ", but I believe the limitation lies more in the input size rather than the algorithm itself. I believe that once we have a large enough input size, the machine could produce something that better mimics human language.
I remember seeing a piece of news that "a Japanese A.I. program just wrote a short novel, and it passed the first round of screening for a national literary prize,". And it is in 2016, four years ago.
I was initially intrigued by the section “Understanding what’s going on,” specifically the subsection “The evolution of samples while training”. While RNNs/LSTMs are often criticized for not being decipherable by humans, here seemed to be a way to get at how the model does it. Karpathy concludes: “The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words; First starting with the short words and then eventually the longer ones. Topics and themes that span multiple words (and in general longer-term dependencies) start to emerge only much later.” But is this not imposing our human intuition where doing so may not be warranted? Is it not just as possible that the algorithm discovers topics and themes early on, but is not yet able to express them in sequences of strings that make sense to humans?
There is a easy way to empirically test this. You can take the the model at an early stage(after all a model is just a bunch of weights stored in matrices) and use it to test it on NLP tasks that require general space structure(Like sentence segmentation), and then take the model at later stages and test it on various NLP tasks that require long term dependencies(Like translation). If an earlier model is good at sentence segmentation but not translation or any of the other tasks that require long range dependencies, it adds tremendous support to his hypothesis. It’s not a perfect experiment for a bunch of reasons but it provides an approach to empirically test whether the information gained by this black box is in the order he suggests.
Incidentally, this is similar to the approach used to test what kind of information is stored in each of the four output layers of BERT, which we will see this week.
Given that RNN seems to be better at identifying and replicating "formats" (as mentioned by @ziwnchen), I wonder if it is more useful in classifying genres rather than topics in real applications? Besides, the author seems to argue that RNN is especially useful in the arena of Computer Vision compared to common NN (maybe I have got it wrong). I wonder why this is the case.
The reading mentions words and characters as outputs. Can these models also output phrases or full sentences (i.e. like Gmail's Smart Compose?)
This post was published a while ago. I'm wondering whether we have an intuitive understanding of why RNNs work so well now?
Is it possible for RNN to be biased or racist if the corpus it was fed into was biased? How do you de-bias an RNN model in that case?
The author mentions that it takes 2000 iterations of training a model on War and Peace to get a sensical output. What does it mean to train a model 2000 times on the same sample, and how are the outputs of each iteration combined? I think this may be a fundamental part of the RNN that I'm not quite clear on.
It is really interesting that we can generate Shakespeare and Wikipedia pages using RNN. I am wondering about how specific confidence number is assigned to each character in different layers.
we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”.
How this confidence number is calculated? What does this confidence number mean?
It is intriguing to see that we can imitate and generate text similar to text with existed style, such as Shakespeare. I'm also wondering that how difficult it would be to assess how accurate the generated text is? Since RNN is black box, how could we know what is working in fact?
I am quite interested in applying RNNs to computer vision, specifically in classifying images which might help us to analyze the overwhelming amount of memes online. However, how many images (and with what size) should be the typical standard in these approaches, and what would be the typical number of iterations required since a large number is already required for text iterations?
There are a few interesting examples given in this blog such as generating Shakespeare's language, markdown text, LaTeX codes, as well as programming code. I am no so familiar with how does the trained model generates the samples. To generate a piece of "fake" Shakespeare's opera, or a LaTeX document, what are the inputs? Or are these samples totally generated by the model? For a recurrent neural network, is it the case that at least the first character needs to be specified to start the recursive steps? If so, how do the examples in this blog start generating the whole sample, and when is it ending (terminates after the final character)? Furthermore, if the process is automated after specifying the starting input, will the result be the same for repetitive trials with the same initial input? (PS: I think the other readings this week answer some of my questions. )
One of the setbacks of the RNNs is,
... they memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the correct way.
As someone with little background in modeling and mathematics, it first seemed to me though that a soft attention scheme for memory addressing should reduce its weak applicability to generalization at least to some extent. Why are the RNNs not suitable for extending its structure to novel situations? Is it rather an inherent, external limitation that they are good at memorizing sequences but not hierarchical structures?
I want to hear a bit more about the specific differences between LSTM and RNN and in what situations you might use one over the other - the author mentions that there are differences to note, but doesn't go into much detail and ultimately uses these terms interchangeably. Beyond machine translation and speech recognition, what are some social science applications for these approaches?
This was a fun read! I especially enjoyed seeing the author's application of RNN to literary works. But like @katykoenig mentions, how would one systematically evaluate these outputs? The author stopped iterating when the output met his human expectation. So while there seems to be scientific interest in studying and understanding the unsupervised outputs of neural networks, we're still judging the value of these outputs based on human standards.
More broadly, the article seems to reflect a growing trend in machine learning where there is value in the non-human interpretable outputs and predictions. As a social science student, I think it's reasonable to adhere to human interpretable standards, and it may be necessary even. But my question is, am I operating with a limited view of the potential of neural nets?
I am really interested in this research on how to produce fake Shakespeare using blackbox algorithms. Although this method seems to result in putting more emphasis on format rather than content, I am wondering if there will be improvements on its applications, like producing real poems, and using its examing mechanism to polish other artificial poems.
I really enjoyed the "fake Shakespeare" generated by RNN. I wonder how we can apply these algorithms to social science questions? Also, how to identify the drawback or limitation of the algorithms for certain tasks since the models are difficult to interpret?
I have heard some of the previous research related to neural networks can help to imitate some famous works like music or poems. Also, some people might worry about if one day this kind of measure becomes mature enough, then many poets or musicians might lose their jobs because AI can produce similar works without further labor costs. Another question is what patterns AI exactly learned are always hard to extract, we can barely see the result but not the processes or details. So are there any improvements so far to reveal some insights on the learning process of neural networks?
The reading is great in explaining how RNNs work. My understanding of the existing machine learning models is that they are not good at picking up sarcasm or nuanced. I am wondering if RNNs would be promising to recognize these textual characteristics.
Last quarter in Computation and Identification of Cultural Patterns, we tried this method on Taylor Swift's song lyrics. We didn't get meaningful sentences, maybe due to limited numbers of layers, but the patterns of Taylor Swift is there. The outcome really needs to be evaluated by the people more carefully which is a bit frustrating especially when it is an very interesting method.
I was wondering if some exemplary texts can be generated by the recurrent neural network. In psychology research studies, participants are asked to read stories with a different narrative to see what words, phrases, or emotional states are related to neural activity. Because it is difficult for researchers to come up with objective exemplary stories, can neural network help with it?
It is fun to read the texts generated by the RNN model and I’m impressed by the model’s “creativity” (e.g. in terms of generating non-existent URLs). As other have mentioned, the RNN models seem to be good at imitating the “structure” and often the content does not make much sense. I’m wondering how models like this can be applied to social science research.
This article clearly explained the basic logic of RNN and the idea of taking of previous information into account in deep neural network when analyzing textual info. However, I am wondering usually how long should the ‘context window’ is for RNN and why use this range of window length?
Karpathy, Andrej. 2015. “The Unreasonable Effectiveness of Recurrent Neural Networks”. Blogpost.