Computational-Content-Analysis-2020 / Readings-Responses

Repository for organising "exemplary" readings, and posting reponses.
6 stars 1 forks source link

Deep Classification, Embedding & Text Generation - Jurafsky & Martin (10) 2019 #43

Open jamesallenevans opened 4 years ago

jamesallenevans commented 4 years ago

Jurafsky, Daniel and James H. Martin. 2015. Speech and Language Processing. Chapter 10 “Encoder-Decoder Models, Attention, and Contextual Embeddings".

lkcao commented 4 years ago

I am curious about how to build a network. From reading this piece, I have the impression that the same task can be completed by a simple RNN, or RNN with LSTM/GRU, and RNN with attention, or RNN with LSTM+GRU+attention, etc. The only difference may be in their performance, but the performance cannot be an objective criterion, because the samle network strucuture may have different performance in different tasks, and we do not have a 'optimal' performance in deep learning so far. How can we build a network in research scenario? In industry I guess they build it by intuition and budget, but I think these standards may not be accepted in study.

katykoenig commented 4 years ago

This chapter touches on beam search, stating that it combines breadth first search with heuristic filtering. While the specification of the algorithm behind beam search was useful, for actual application, how do would we choose beam width?

Additionally, I understand that completed paths are returned, but I am still confused as to how these paths join together to make sense to a human reader: we see in fig. 10.5 (p.197) that when a sequence is completed, it is the end of search, but these completed paths are not passed to future layers. Does this mean that what is returned from the algorithm is a list of sequences that can be substituted for one another, or does it return almost like a paragraph of sequences that should have meaning as a whole?

deblnia commented 4 years ago

I'm a little confused on how the attention mechanism can overcome the deficiencies of the other approaches to context. How does a fixed length context vector that dynamically takes into account information from the entire encoder state at each step of decoding make a marked improvement on the first mentioned methods?

laurenjli commented 4 years ago

Encoders are used to provide contextualized representation for the decoder and are often good for summarization and captioning. I wonder if they could also be used for topic modeling?

ccsuehara commented 4 years ago

To be completely honest, i haven't yet grasped the general idea of encoder-decoder models, and it would be really helpful for me if you can go over the main features of using this model, how it works and what we can do with it, depending on the purposes we want to achieve.

jsmono commented 4 years ago

This article was a bit hard for me to absorb so my questions are mainly due to the lack of knowledge of the field. In the end, the authors listed a few applications of the method, such as summarizing contents or simplifying sentences. I'm wondering if there is a way to learn how these applications applies the method to produce meaningful results. It can probably help me to better understand the mechanism of the method the authors discussed here

vahuja92 commented 4 years ago

The paper states that RNN models can better incorporate the context of documents because it uses all past words along with the current word to predict the next word. In contrast N-gram models and the sliding-window approach only take into account the current N-gram/window. Could using larger N-grams or sliding windows have a similar effect if doing so was computationally possible?

heathercchen commented 4 years ago

I am also wondering about the applications of Encoder-Decoder Network mentioned by the authors in section 10.4. The authors mentioned that this network method can be used in image captioning. How does it work? As for summarization and simplification of texts, we use sequences of texts to predict a sequence of characters, which are of the same unit. But for image caption, its aim is to use pixels to predict characters. Can you explain the methods or intuitions behind that?

di-Tong commented 4 years ago

The applications of encoder-decoder models mentioned in this chapter seem not to be usual tasks that are essential to answering social science questions (machine translation, question answering, image captioning, etc.). I wonder if you could discuss potentials of these models to address new questions unable to be answered before or better address common social science inquires than the current methods used for answering those questions.

yirouf commented 4 years ago

I have similar questions regarding the application of encoder-decoder models. Say, in its function of summarizing texts/content, how would such application aid into the filed of social science in general in generating interesting results (also interesting interpretation of such results)?

luxin-tian commented 4 years ago

It has been mentioned many times in the readings that "question answering" can be realized as an application of the RNN and its further encoder-decoder model. I wonder how does this work as the first step for a trained model to answer a question is to understand the question. I can somewhat intuitively imagine how can a model extract the key information from text data, but I wonder how can it understand the question so as to respond with the corresponding information?

chun-hu commented 4 years ago

The paper states that there are usually stacked architectures of network layers for the encoder, where the output states from the top layer of the stack are taken as the final representation. I'm having a hard time understanding and visualizing the stack of layers, and how do these stacks build on each other?

bjcliang-uchi commented 4 years ago

I have used RNN LSTM before and in practice my question is, is it possible to "teach" the algorithm some broader contexts--such as certain emphasis in weights and/or the connections among sentences and paragraphs?

sunying2018 commented 4 years ago

In the part of Beam search decoding part, I am confused about the stop condition of the while loop. What does "while frontier contains incomplete paths" and how can we identify it? Since this algorithm only keeps tracking the complete_paths and I do not see the related incomplete paths tracking in this algorithm.

rkcatipon commented 4 years ago

This may be something we cover next week, but I'd be curious to see if we could apply encoder-decoder models for generative adversarial networks such as Deep Fakes? The authors mentioned the application of image captioning, and it made me wonder.

YanjieZhou commented 4 years ago

It is widely accepted that it is hard to interprete the multiple layers of neural network algorithm, but I think it is more meaningful for us to understand the input end of the algorithm rather than coping with annoying and almost uninterpretable neurons.

alakira commented 4 years ago

I am still confusing how to calculate weights (or attention) from encoder-decoder, which then used for tuning the other parameters in the architecture.

ziwnchen commented 4 years ago

As @clk16 mentioned, there are different approaches in deep language models for similar tasks. However, their theoretical intuitions do have some differences. For example, apart from RNN which somehow generates an aggregated hidden state, the main intuition of autoencoder is its theoretical basis in the latent variable model. That said, when using the autoencoder-decoder model, people try to identify important features that explain the semantic phenomena. But again, in terms of social science application, interpretability is still a major problem. I'm wondering the potential approaches to deal with that problem.

sanittawan commented 4 years ago

I am in the same situation as @ccsuehara because I don't feel I have a very good understanding of the encoder-decoder model either. Besides the model in general, I also have a specific question. Under section 10.2, the authors mention that Simple RNNs, LSTMs, GRUs, convolutional networks, and transformer networks can be used as an encoder. Is there a rule for selecting which model to use for which context?

kdaej commented 4 years ago

Would it be more efficient to have multiple layers of hierarchical encoders? Many sentences include multiple clauses where there is only one main clause. When text data are fitted in the neural network model, should these clauses be discriminated or treated equally?

VivianQian19 commented 4 years ago

I’m not entirely sure I understand the concept of attention but it seems it is considered a more effective approach to context than BI-RNNs because it can dynamically update the hidden states during decoding and the context vector is renewed during each decoding step whereas in BI-RNNs, context are static vectors. So the goal of the attention mechanism is to have a fixed-length context vector that “takes into account information from the entire encoder state that is dynamically update to reflect the needs of the decoder at each step of decoding”. In the orienting reading, “The Unreasonable effectiveness of Recurrent Neural Networks”, the author also mentions soft attention vs. hard attention. I wonder what is the advantages of using a soft attention vs. hard attention approach?

cindychu commented 4 years ago

The encoder and decoder networks is very interesting and a creative algorithm design, which utilized the fed in contextualized information, however I am still wondering how it could be used in translation since the symbol (one language in encoder, the other in decoder?) is so different; and I am also wondering, how long should be the encoder is to better capture the contextualized information?

cytwill commented 4 years ago

This chapter show me the encoder and decoder networks that I did not hear before. I am a little confused about the idea of "contextualized information", from the description, I think it is kind of like a type of embedding, which can impact the specific meaning of the words. So it is very likely that this method could be useful in generating word-embeddings, but it seems that it has not been used a lot in this field. So I am wondering what kind of drawbacks limit this possibility?