Deep Classification, Embedding & Text Generation - Jurafsky & Martin (9) 2019

jamesallenevans commented 4 years ago

Jurafsky, Daniel and James H. Martin. 2015. Speech and Language Processing. Chapter 9 (“Sequence Processing with Recurrent Neural Networks"”).

lkcao commented 4 years ago

I am not so sure, but I heard a theory, that studies about deep learning is mainly about tuning the hyperparameters, and it is also in the case of RNN because there are so many hyperparameters to tune. Is it true? If so, how can we combine it with social science, since their objects are nearly opposite (accurate prediction without any intrepretion vs. pursuit for explanation with despise on pure data mining)?

arun-131293 commented 4 years ago

I am not so sure, but I heard a theory, that studies about deep learning is mainly about tuning the hyperparameters, and it is also in the case of RNN because there are so many hyperparameters to tune. Is it true? If so, how can we combine it with social science, since their objects are nearly opposite (accurate prediction without any intrepretion vs. pursuit for explanation with despise on pure data mining)?

Yes, there is a tension between the two as I suggested in my post under the "Unreasonable Effectiveness of Neural Networks" listing. The reason has to do with the fact that interpretability of Neural Networks is currently a function of understanding what individual neurons do and the more neurons there are, the harder it is to interpret(It used to be previously based on looking at the representations/outputs a neural network creates). This is not just because of the amount of neurons to look at becomes too high, but also because in Neural Networks, just like in DNA, a single neuron might not be assigned to doing one task (eg: URL detection); rather both a bunch of neurons might be interacting to perform one task and a single neuron might play different roles in different tasks.

Therefore, when we increase the number of parameters(which usually corresponds to the increase in the number of neurons) we might be increasing the "distributed-ness" of each task, where more and more neurons might be getting involved in single tasks and a single neuron might be getting involved in more and more tasks(eg: URL detection, End of Sentence detection and POS detection) but it is less important in each task by its own. Therefore removing one single neuron to understand the effects on the output, which in turn helps us understand the original function of that neuron, is harder in a larger network since the output may not change much at all or might change a little bit along multiple dimensions.

katykoenig commented 4 years ago

The chapter's examples of RNNs explicitly use only hidden layers from the previous step, e.g. a hidden layer (h{t}) is found via adding in the previous step (h{t-1}) by it's weight matrix to the input by it's weight matrix and putting this sum into an activation function). I am curious as if it is ever common/beneficial to have larger cycles in the RNN (e.g. using h{t-2} to calculate h{t})? Additionally, I am wondering how cycles affect the choice of activation functions (if at all)?

ckoerner648 commented 4 years ago

I was intrigued by Jurafsky and Martin’s statement that sentiment analysis disregards the sequence of a text. Following the findings of psychologists reported in Daniel Kahneman’s “Thinking Fast and Slow,” it is reasonable to assume that readers do not go away from a text with the sentiment that the average sentence created in them, but that their sentiment follows the peak-end rule, i.e., that the sentence that created the most intense emotion and the last sentence of the text drive most of the outcome. It would be interesting to run an experiment that asks readers how they feel about a text and compare their answers to the results of sentiment analyses that either ignore or take into account the peak-end rule.

deblnia commented 4 years ago

Like @ckoerner648, I'd like a clarification on the temporality of text. They seem to reject sentiment analysis because it's a bag of words method (i.e. there is no inherent sequencing). But if we were to use bi-grams or tri-grams or some kind of co-occurance preservation wouldn't sentiment analysis be just as temporal? Am I just missing the point they're trying to make in re: temporality?

ccsuehara commented 4 years ago

I am also intrigued by how temporality is managed in these models "the network needs to learn to forget information that is no longer needed and to remember information required for decisions still to come.". Since memory is a crucial part of RNN's, how do they manage it?

laurenjli commented 4 years ago

The chapter states that "The forget gate computes a weighted sum of the previous state’s hidden layer and the current input and passes that through a sigmoid. This mask is then multiplied by the context vector to remove the information from context that is no longer required." Similar to others above, I'm confused about how models know how to remove what is "not needed" anymore.

HaoxuanXu commented 4 years ago

It's interesting to see RNN utilizes a temporal structure to learn the text. Is the forget gate robust enough for different languages that may have different length of context for each sentence or sentences?

di-Tong commented 4 years ago

What are the criteria for deciding the optimal number of stacks for the stacked RNNs? I understand it is pretty contingent upon the application nature and training data, but how? If not considering the efficiency problem, is a larger number associated with better model performance?

heathercchen commented 4 years ago

I am wondering will the length of input window affects the output? if it does have influences on output, which seems to be quite obvious, what kind of influence it will impose on the output? Also, in my opinion, if we increase the length of input window, we will capture more information and more general trend of what a sentence want to convey, does it mean that it will improve the outcome of the output?

luxin-tian commented 4 years ago

The "Generation with Neural Language Models" section answers how the auto-regressive generation starts and terminates.

To begin, sample the first word in the output from the softmax distribution that results from using the beginning of sentence marker, \, as the first input. • Use the word embedding for that first word as the input to the network at the next time step, and then sample the next word in the same fashion. • Continue generating until the end of sentence marker, , is sampled or a fixed length limit is reached.

But this only interprets the process of generating a sentence, which starts and ends with sentence markers. I wonder for those examples of generating a markdown/LaTeX document, how does the process get initiated and when does it terminate? That is, is there also any marker element that can be sampled from the distribution to start/terminate the auto-regressive generating process?

wunicoleshuhui commented 4 years ago

Since this chapter discussed many aspects to and applications RNNs, I felt it was somewhat difficult to keep track of the features described in this chapter. For example, is the end-to-end training a feature of the most uses for RNNs, or is it more specific to sequence classification? How do we keep track of which uses of RNNs have what common features more directly?

bjcliang-uchi commented 4 years ago

Is there any useful information we can extract from the embeddings? If so, how, and which one is more helpful, the encoder and decoder embeddings?

sunying2018 commented 4 years ago

I have a question in stacked RNNs. Since this is a step by step algorithm, we do not know the next RNN when we stay at the current RNN, in this case, will the order of different RNN matters to performance of this model in the global scope?

rkcatipon commented 4 years ago

The chapter states that "The forget gate computes a weighted sum of the previous state’s hidden layer and the current input and passes that through a sigmoid. This mask is then multiplied by the context vector to remove the information from context that is no longer required." Similar to others above, I'm confused about how models know how to remove what is "not needed" anymore.

I also had this question, because from the Karpathy article, there was not a systematic way to evaluate output and so I wanted to learn more about the evaluation of information during the generation process.

YanjieZhou commented 4 years ago

The tradeoff between the interpretability and the accuracy of the model has been the critical part of deep learning, but conversely, I am wondering if it is truly necessary to interprete the results when using deep learning, because to convey the information hidden behind the models is always difficult with plenty of graphs needed and contrained by the time, when exterior parameters change, what come along are the new models and new interpretations which are actually very costly.

tzkli commented 4 years ago

It seems we have some intuitions about why some neural network approaches work better than others in certain tasks, for example by striking a finer balance between retaining and trimming information. But these are still highly uninterpretable. I guess at this stage, they are more useful for uncovering "unobservables" than for explaining certain phenomena?

alakira commented 4 years ago

After attention mechanism, or Transformer outperformed RNN for solving natural language tasks, what is the remaining strength of RNN especially solving natural language problems?

ziwnchen commented 4 years ago

I'm wondering about the current application of RNN in the language model. I heard that almost all advanced language model that performs well on GLUE (General Language Understanding Evaluation) are self-supervised language model (e.g., BERT). I'm a little confused about the current status quo of the language model and how RNN is currently developed.

sanittawan commented 4 years ago

In section 9.2.3 on sequence classification, the authors mention that RNNs can also be used for document-level topic classification. I am not sure if my understanding is correct, but this method is different from topic modeling in the sense that RNNs requires a fixed set of labels. Is this understanding correct?

kdaej commented 4 years ago

It seems that deep networks behave well when there is enough data to train the machine. Nevertheless, some languages do not work as well as English does. I wonder what might make some languages harder to train the machine.

VivianQian19 commented 4 years ago

The chapter gives a very detailed introduction to different types of RNN. Some are simple such as simple recurrent networks, and some are more complex and also more powerful such as Long Short-Term Memory (LSTM) and Gate Recurrent Units (GRU). The article touches a bit on the differences between LSTM and GRU such as the former has a higher training cost because the additional parameters drive up the cost of training while GRU reduces the number of gates to 2, i.e. reset and update gates. I wonder in application, which of these two RNNs are more often used and under what context?

cytwill commented 4 years ago

I have some similar questions about the application of neural networks in language modeling. if there are any benchmarks or qualified ideas to guide us to choose from different neural networks? I think based on the different structures of these types of neural networks, they might be suitable for different types of corpora or word-beddings. So how could we make our judgment to know which one to use rather than try them all?

Computational-Content-Analysis-2020 / Readings-Responses

Deep Classification, Embedding & Text Generation - Jurafsky & Martin (9) 2019 #42