UChicago-Computational-Content-Analysis / Readings-Responses-2023

1 stars 0 forks source link

7. Accounting for Context - fundamental #17

Open JunsolKim opened 2 years ago

JunsolKim commented 2 years ago

Post questions here for this week's fundamental readings: J. Evans and B. Desikan. 2022. “Deep Learning?” and “Deep Neural network models of text”, Thinking with Deep Learning, chapter 1, 9

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. 2017. “Attention Is All You Need.”

Pryzant, Reid, Dallas Card, Dan Jurafsky, Victor Veitch, Dhanya Sridhar. 2021. “Causal Effects of Linguistic Properties”.

konratp commented 2 years ago

In "Attention Is All You Need", the authors claim that "Most competitive neural sequence transduction models have an encoder-decoder structure" (p.2). It's hard for me to imagine how such a model oculd ever work without relying on such encoder-decoder structures? Are there successful examples where it was a strength of the paper not to rely on encoder-decoder structures?

pranathiiyer commented 2 years ago

I've often read that ReLU is the most widely used activation function for certain reasons. However, I'm not sure I understand completely how different activation functions differently affect the output of of a specific node. The first chapter Thinking with deep learning touches on this, but I think I'd understand better if it were broken down a little more with certain examples!

ValAlvernUChic commented 2 years ago

I'm curious about how these architectures can handle polysemous words (assuming it's not large models like BERT, GPT, GP2). If I understand the models correctly, each word is assigned a vector in the embedding layer but if that's the case then, at least intuitively, it seems that multiple meanings from polysemous words might be lost. For example, "good" - "The pope was a good person" vs "He was a good guitar player"; the former being a moral evaluation and the latter a skill evaluation.

facundosuenzo commented 2 years ago

In Evans and B. Desikan's chapters: 1) What is the relationship between the loss function and the accuracy measure? 2) What are the costs (empirical and practical) of adding layers to your neural model? To what extent does the nature of the data limit those layers? And thus, how flexible are they to expand/contract?

isaduan commented 2 years ago

How can we access different internal representations within the transformer and use them for analysis, if at all? Would appreciate a walk through of those components and what they mean intuitively.

Jasmine97Huang commented 2 years ago

In Attentions is All You Need, the authors mentioned "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions". Such structure sounds really appropriate and promising to capture the hidden, deep semantic space like sarcasm, slang, polysemy or metaphors. Can you give more examples of how multi-head attention improves language representations?

Qiuyu-Li commented 2 years ago

This week's fundamental readings are really information-intensive. My question is: Is there any way to learn about the text data by observing the representation instead of the outcomes? For example, I may want to train my model to perform a classification task, but is there any way to learn about the weights of text features?

Sirius2713 commented 2 years ago

I have a question about the causal effects paper: in the "Improved Proxy Labels" part, they say they relabel examples with predicted labels = 0, but reader perceived labels look like 1. But if they use the same classifier they trained to relabel, how come they will different results when relabeling?

I also appreciate more details and explanation about the attention mechanism.

Jiayu-Kang commented 2 years ago

In Evans and B. Desikan's chapters:

  1. What is the relationship between the loss function and the accuracy measure?
  2. What are the costs (empirical and practical) of adding layers to your neural model? To what extent does the nature of the data limit those layers? And thus, how flexible are they to expand/contract?

My understanding is that the loss function produce a measure of distance (so it can be used to optimize the model), while accuracy shows the performance (i.e. how many correct predictions are made among all), so I guess there's not necessarily a fixed relationship, and both should be considered when evaluating the model.

For Question 2, I guess overfitting could be a problem? And I'm also curious about what the convention is to decide the number of layers/nodes and how data types affect the number of layers.

sudhamshow commented 2 years ago

A couple of general questions regarding deep neural networks - 1) Multilayer feedforward neural networks are sometimes deemed as universal function approximators (Kur Hornik et al, 1989). How do deep neural networks achieve this complexity without overfitting? 2) If they can be estimated as universal approximations, wouldn't it void the 'no free lunch theorem'? A question on the No free lunch theorem as well - What is the level of task abstraction where the models begin to fail? Is it a completely different set of tasks (image vs text vs audio) or sub tasks within a genre (text - classification, generation) 3) The paragraph on loss functions (Evans and Desikan, Chapter 1) also state the use of MAE as a loss function? Does the optimiser use a gradient descent like technique? If yes, how is differentiation handled at the minima?

mikepackard415 commented 2 years ago

I find the content we're exploring here pretty fascinating, but to be honest the huge space of possible setups is a little overwhelming. I guess I'm wondering about accuracy in these models. Are we at a point where two neural network setups might disagree on a question given the same inputs just as reasonably as two humans might disagree? To what extent are these setups black boxes where we kind of lose the ability to understand what is going on under the hood?

GabeNicholson commented 2 years ago

I've often read that ReLU is the most widely used activation function for certain reasons. However, I'm not sure I understand completely how different activation functions differently affect the output of of a specific node. The first chapter Thinking with deep learning touches on this, but I think I'd understand better if it were broken down a little more with certain examples!

The thing to note with this is that it's the derivative of the activation function which matters most (so long as the activation gives the information that is needed in the first place). With Tanh and Sigmoid, their derivatives get crazy at the extreme ends of the function—so when you train the model—backpropagation goes wrong and you have vanishing or exploding gradients. This is why Relu works so well, (1) it does the job well of being an activation function (it is super simple), and (2) has well-defined derivatives that can be used for backpropogation. So to summarize, it's not so much what makes an activation function "good", it's more about how to avoid an activation function that has bad derivatives at any point. Many functions can do the job, it's just that they can blow up in bad cases.

hshi420 commented 2 years ago

For the pretrained model, can the downstream tasks' performance be improved with different pretrained corpus?

Hongkai040 commented 2 years ago

I' thinking that if both training set and test set are randomly sampled, can a model perform better on the test set? And Is it necessary to pursue 0 loss(or, as low as possible)?

LuZhang0128 commented 2 years ago

This week's readings, as well as some from previous weeks, made me feel that I need to learn more about the mathematical model behind the code. For instance, in neural networks, different loss functions have different implications. I really want to understand why and how it works, in order to find the best method for my textual data.

sizhenf commented 2 years ago

I'd love to read more about the effect of tones/languages, not just on Amazon reviews.

NaiyuJ commented 2 years ago

If the results are not that ideal When we implement a deep learning model, Then should we change to another model or just adjust the parameters?

kelseywu99 commented 2 years ago

Like how condensed this week's fundamental readings are. Although I was still sort of confused by the relation between the three layers and the analytic pipeline? Like what does pipeline do to the neural network?

YileC928 commented 2 years ago

The Pryzant et.al.(2021) paper introduces a number of useful techniques in estimating ATE of linguistic properties. I just read a few papers that use ITE to do causal inference on text data, just wondering - (which may be a stupid question) when should we use ITE instead of ATE?

chentian418 commented 2 years ago

In Chapter 9: Text Learning with Sequences, I am confused about when to use Long Short-term Memory (LSTM). I remember Pham and Shen(2017) from last week's readings utilize LSTM to estimate the propensity score functions. I am also curious about contextual word embeddings that when do we value the contextual meaning of words more than uniform meaning of word so that contextual embeddings can be more appropriate in the general social science research regime?

Emily-fyeh commented 2 years ago

I would like to know if hidden layers of the models can be dissolved and interpreted, in order to help us understand how linguistic features are transformed in some downstream tasks.

ttsujikawa commented 2 years ago

From my limited understanding of ML, on image process field, it's been said that pre-process, especially annotation, is really important factor to raise level of training. I think this is meant to put weight on pixel data of images and to ignore unnecessary parts of images. Then, is there any way to do such job on textual data to enable machine to focus on important parts and ignore rest?

melody1126 commented 2 years ago

For the Vaswani paper on attention, what is the purpose of defining "attention" separately? What is the difference between different attention models in terms of what they can achieve and what inputs they take (scaled dot-product v.s. multi-head)?