6. Large Language Models (LLMs) to Predict and Simulate Language - fundamental

lkcao commented 6 months ago

Post questions here for this week's fundamental readings:

J. Evans and B. Desikan. 2022. “Deep Learning?” and “Deep Neural network models of text”, Thinking with Deep Learning, chapter 1, 9

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. 2017. “Attention Is All You Need.”

Pryzant, Reid, Dallas Card, Dan Jurafsky, Victor Veitch, Dhanya Sridhar. 2021. “Causal Effects of Linguistic Properties”.

XiaotongCui commented 5 months ago

Very stepping stone paper! I have a general inquiry regarding the training of transformers (or other deep learning models like neural networks). How do researchers systematically determine the number of layers in these models, and decide on the activation functions (e.g., choosing between linear and ReLU), as well as where to apply functions like softmax? I often find myself randomly selecting these settings. Is there a structured approach to designing the architectures of models that ensures a more systematic and informed decision-making process?

Vindmn1234 commented 5 months ago

After we all become accustomed to transformer architecture, looking back into the paper "Attention is all you need", we are still amazed at how the self-attention mechanism is such a brilliant invention. As for NLP tasks, prior to transformer architecture, LSTM networks were considered one of the state-of-the-art models. However, transformers offered a new approach based on self-attention mechanisms, allowing models to weigh the importance of different words within a sentence or document without the sequential processing limitations of RNNs and LSTMs. I once heard Mu Li, a famous Chinese AI scientist at Amazon described transformer based large language models as "大力出奇迹（Great efforts bring miraculous results)", meaning so far, as long as we increase the number of parameters, making the model more complex and use more data to train it, its performance will continue to improve, the limitation of this architecture hasn't been reached and won't be reached in the near future.

Moreover, I think the most significant advancement brought forth by this architecture is its role as a unifying force between the fields of natural language processing (NLP) and computer vision(CV). Historically, these two domains evolved mostly in parallel, with NLP leaning on LSTM models and CV relying on CNNs and GANs(generative adversarial network). However, the advent of the transformer has changed this dynamic, allowing both fields to benefit from a shared architecture. This means that breakthroughs in one area can be rapidly applied and transplanted into the other. For example, the principles of the transformer have been adapted to Vision Transformers (ViT) in CV, the evolution from BERT to Masked Autoencoders (MAE). This cross-pollination paves the way for significant advances in multimodal models, like OpenAI's CLIP and DALL*E , demonstrating the vast potential of this integrated approach.

bucketteOfIvy commented 5 months ago

TextCause seems to be a really powerful method which, if I'm understanding it correctly, gives us a way to estimate the casual effect of a given linguistic property $T$ given that we have a good way to measure it, a way to estimate other unrelated linguistic properties $Z$*, and a sense of the covariate variables $C$. However, I'm not entirely clear on what role the covariates $C$ are playing in the model, or even how they are selected. While this might be a causal inference question more broadly, how should we think about the role of the covariates $C$ for in TextCause, and (given that intuition for the covariates) how should we select them for use in our own analyses?

*which seems to often come for essentially free when working on text data, since we have the corpus $W$ in which those properties [usually] reside

sborislo commented 5 months ago

TextCause seems like a creative and effective approach under the right conditions, and it reminds me of the use of instruments in quantitative marketing research to account for the lack of direct measurement. However, TextCause's necessary assumption that the treatment effect sign be homogenous across possible texts seems like a pretty strong assumption in many cases. In the Amazon review example, for instance, if the comparison was between 5-star ratings and 4-star ratings instead, would 4-star ratings not be preferred for certain kinds of digital music products since 5-star ratings might be seen as suspicious? This may seem like a rather niche example, but I don't think it is.

In cases in which there is not a homogenous treatment effect sign, is there a way to get around this assumption?

yuzhouw313 commented 5 months ago

In "Attention Is All You Need," Vaswani et al. mainly experimented with English-to-German machine translation tasks to demonstrate the power of self-attention using the encoder and decoder structure. Considering the transformer model's exemplary achievements in these language translation tasks, I'm intrigued to see its broader implications for other NLP tasks. Therefore, my question is: can the transformer architecture be considered a universal solution for NLP challenges as it addresses the issue of long-range dependency and high computing power requirement, or are there specific tasks for which it might not be suitable?

volt-1 commented 5 months ago

"Attention is All You Need" has been a game changer in NLP and many other fields. It enabled smaller infrastructure kits to do the work of entire data centers. The importance of this paper to AI cannot be overstated. I personally don't have any technical question about Transformer architecture.

7 years later, the migration of Top AI talents like ALL 7 authors (Vaswani, Shazeer...) to startups or tech giants (Character.ai, Inflection.ai, Adept.ai etc.) is an interesting trend. It highlights a significant shift in the AI landscape, where industrial players are increasingly dominating the field. This does raise concerns about the widening gap between academia and industry, as noted by Feifei Li at Stanford. The challenge for academic institutions in competing with the resources of large tech companies, especially in training large-scale language models, is indeed formidable.

My question is: How can academia adapt to the trend of talent migration to the industry? What strategies could be employed to ensure that the flow of knowledge and innovation remains bidirectional between academia and industry?

YucanLei commented 5 months ago

The paper contributes to the understanding of how linguistic properties, specifically politeness in complaints, can influence customer-company interactions and response times. Similar traits had also appeared in GPT where some people believe if you speak with GPT more politely, you are more likely to be chosen to join the BETA test. My question is:

How do cultural and contextual factors influence the effectiveness of polite language in complaints and its impact on response times from companies?

In some cultures such as Japanese culture, although widely considered as perhaps the most polite culture. Some also suggest they can be polite in different ways where some polite language that appears kind for other people might actually be perceived as condescending in Japan. So how does cultural factors be considered in here?

alejandrosarria0296 commented 5 months ago

The basic structure of deep neural networks explored in the first chapter of Thinking with Deep Learning show inputs, layer and an output. Are these layers necessarily sequential? Are there architectures in which a submodel or an entire layer is used recursively?

h-karyn commented 5 months ago

The paper “Causal Effects of Linguistic Properties” introduces TEXTCAUSE, an algorithm leveraging distant supervision and BERT for estimating causal effects of linguistic properties. However, how does TEXTCAUSE specifically handle the inherent ambiguity and context-specific nature of language, especially when linguistic properties might have different connotations in various contexts?

yunfeiavawang commented 4 months ago

The paper "Causal Effects of Linguistic Properties" by Pryzant et al. explores the challenge of estimating causal effects of linguistic features using observational data. The paper mentions the challenge of confounding variables and the need for assumptions in causal inference. What are some specific scenarios where these assumptions might not hold, and how would that impact the validity of TEXTCAUSE's estimates?

yueqil2 commented 4 months ago

"Causal Effects of Linguistic Properties" seems to take a unbelievable challenge: estimation the causal effects of latent linguistic properties from observational data. I wonder how other social scientists comment on this approach and how to evaluate its applicability.

cty20010831 commented 4 months ago

I think this paper has significant practical implications to different industries. For instance, I am thinking that it can be applied to the business setting (e.g., examining the extent to which a positive product review increase sales). Hence, I am wondering what the caveats of applying this academic product to real-life business settings?

naivetoad commented 4 months ago

"Attention is All You Need" introduces the Transformer, a novel neural network architecture for sequence transduction based entirely on attention mechanisms. How does the Transformer's reliance on attention mechanisms alone affect its ability to model longer sequences or sequences with more complex dependencies?

runlinw0525 commented 4 months ago

Regarding "Attention is All You Need", my question is: given the Transformer model's transformative impact on machine translation tasks, as demonstrated in the paper, how might it revolutionize other areas of AI, such as natural language understanding and content generation? Its ability to handle long-range dependencies more effectively than previous architectures is particularly noteworthy.

donatellafelice commented 4 months ago

i have really a general question that came to mind when reading the opening chapter of Thinking with Deep Learning specifically about all our options: "For this reason, options form the skeleton of each chapter in this book; and where the choice is not entirely clear about which to choose, we encourage you to (1) experiment, (2) buy a collection of multi-sided dice to enhance the experience (and increase the variety) of your random selection (see Figure 1-1), and (3) design new neural networks to help you think through critical decisions, as we begin to introduce in chapter 3. We invite you to exercise the flexibility of choice between options in deep learning models because for some challenging tasks, you will need to explore them all!"

I ask then, for those of us who do not have the luxury of a TA/office hours meeting with an expert in the field, is there some centralized code book or agreed on place that collects and recounts all these novel and interesting experiments and deployments that people have published? When you experiment, it is often better to mimic first. Is there a code book that can help us understand which model has been used on which type of data set previously? it seems like something and LLM itself would be the perfect tool for....

ethanjkoz commented 4 months ago

Through reading the Vaswani et al. 2017 paper, I have admittedly gotten somewhat lost. I understand the power that transformers pose in terms of computational efficiency and such. But, I am struggling to understand the section on the model architecture. As I understand, attention functions map queries to key-value pairs. However, I get lost when the paper discusses scaled dot and multihead attention. Also, I was wondering about their usage of "self-attention," how does this differ from the normal attention function?

joylin0209 commented 4 months ago

I'm interested in the paper by Pryzant et al. on applying fine-tuned language models to tune text in causal inference. How can the capabilities of pre-trained language models like BERT be utilized to enhance the accuracy of causal effect estimates of language attributes?

beilrz commented 4 months ago

TEXTCAUSE could be a very useful tool for social science research. My question is what are some tasks that we could utilize the capability of TEXTCAUSE. Furthermore, can TEXTCAUSE be extended to NLP and casual inference task for other language, in other domains also?

anzhichen1999 commented 4 months ago

How might the methodologies on adjusting confounding information in text documents can be applied to enhance the understanding of the changes in scientific collaboration and novelty during the COVID-19 pandemic, as highlighted in the 'Pandemics are Catalysts of Scientific Novelty' paper?

QIXIN-ACT commented 4 months ago

Exploring "Attention is All You Need" brings a wave of excitement! However, navigating through the specialized terminology can be quite challenging. In approaching research papers of this nature, is it crucial to fully grasp the underlying mechanisms, or should we aim to understand them to a certain degree? Additionally, TextCause piques my curiosity, yet I question whether quantitative research alone is sufficient for drawing causal inferences without the support of qualitative studies.

icarlous commented 4 months ago

I am interested in comparing the attention of human and machine. There are linguistic papers analyzing the distribution (time) of people’s attention in their reading process. How can these different attention inform each other?

chenyt16 commented 4 months ago

The paper introduces the concept of self-attention, which is the foundational architecture of large language models. By utilizing self-attention, the model can achieve a better understanding of the input context. However, the computational cost of self-attention-based models is quite high during training and inference. Is there any way to mitigate this problem?

Caojie2001 commented 4 months ago

I tried to understand the interesting and important paper 'Attention is All You Need' through some online resources, but I think I can only interpret a limited part of its overall strategy and implementation details. For example, I still don't really understand how the three tensors, Q, K, and V, are created in the self-attention mechanism.

Twilight233333 commented 4 months ago

The authors' assessment of the causal effect is impressive. I wonder how the authors can clearly identify the causal problem, such as the polite complaint letter, and whether they can control for more variables, such as the mood of the person reading the letter. If you assume that the mood of the letter has an effect, then you have to control the mood of the reader

Brian-W00 commented 4 months ago

How to make better use of attention models and deep learning for understanding and predicting how certain ways of using language affect communication success in different situations?

Marugannwg commented 4 months ago

First time visiting "Attention is All You Need"... It is hard to read the paper without a neural network and deep learning background, and I resorted to video and many other resources. It feels like the researchers found a way to handle the long-range dependency issue in NLP: previously, everything had to go in step by step series, but now the decoder can look at (some place??) and input and output are coming into the model at the same time, handled together.

I'm trying to make sense of the graph -- looks like there are multiple attention involved (at least 3 seen in the graph). How do we understand the second attention layer where it takes in the "query* of output but the "key" vector from the input?

HamsterradYC commented 4 months ago

The paper “Causal Effects of Linguistic Properties” discusses overcoming confounding in text to estimate causal effects more accurately. However, it does not deeply explore the impact of evolving language use over time on these estimates. How might temporal shifts in language and semantics affect the reliability of causal estimates made by TEXTCAUSE, and what strategies could be employed to mitigate such impacts?

JessicaCaishanghai commented 4 months ago

This paper is quite interesting and investigate the causal relationship of different linguistic models. I always think it's challenging when it comes to language notations. Given the success of the Transformer model in achieving superior performance on machine translation tasks while being more parallelizable and requiring less training time compared to traditional recurrent or convolutional neural network-based models, how might its adoption impact the development and deployment of natural language processing systems in various domains? For example, can it be used to detect the real writers of some poems published anonymously ?

Dededon commented 4 months ago

The BERT model is implicative. I'm curious about what are the current criticisms of BERT-based models, and what are the social science research tasks that BERT-based model performing badly on. Should we avoid using BERT-based models in some cases?

erikaz1 commented 4 months ago

If all neurons in a layer have the same activation function, then how do we determine the number of neurons in each layer, and how does each neuron uniquely contribute to the network? (A: Different weights)

How do we determine the amount of redundancy within a network, and how important is redundancy?

floriatea commented 4 months ago

In "Causal Effects of Linguistic Properties", given the significant role of non-verbal cues (tone, pauses, facial expressions) in communication, how could future models integrate these aspects to provide a more holistic understanding of the causal effects of communication strategies?

Carolineyx commented 3 months ago

how does the Transformer model ensure the capture of complex sequential information that recurrent models inherently handle well? Additionally, how does the model's structure support or limit its ability to learn and generalize across different languages and tasks beyond machine translation, such as more nuanced linguistic or context-dependent challenges?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

6. Large Language Models (LLMs) to Predict and Simulate Language - fundamental #22