Week 4: Apr. 12: Text Learning - Orienting

JunsolKim commented 8 months ago

Post your questions here about: “Text Learning with Sequences”, “Text Learning with Attention”, Thinking with Deep Learning, chapters 9 & 10

uc-diamon commented 7 months ago

From reading "Text Learning with Attention" - is perplexity used during the training process or in feed forward?

maddiehealy commented 7 months ago

I am wondering about current seq2seq applications. Chapter 10 outlines some limitations with translating in sequence. For example, translating the French phrase “le chat noir” to “black cat” presents challenges because the order of words is not consistent, so in the literal/sequential translation it would output “cat black” instead of “black cat”. To my understanding, the alternative method for translating short sentences is outlined through LSTM methods. Is this what the current and successful translating services, like Google Translate, use? Or do LSTM implementations only work on shorter texts?

mingxuan-he commented 7 months ago

A question on next token generation: why does randomly sampling from a probability distribution proportional to the score size (vs. just choosing token with the highest score) improve model performance?

anzhichen1999 commented 7 months ago

What are the comparative effects of fine-tuning different transformer-based models (such as BERT) on domain-specific tasks like legal document analysis or crime classification, in terms of both accuracy and computational efficiency? In a previous project, I noticed that GPT-4 already have a high accuracy by self-asking prompt engineering, with an inter-model reliability (with human beings considered as a model) of 0.916?

Audacity88 commented 7 months ago

I am curious about how different word embedding models (word2vec-CBOW, word2vec-skipgram, FastText, GloVe) perform relative to each other. As we saw in the HW for last week, when trained on a small corpus, word embedding models may reach very different estimates of word similarities. Is the same true for these different models? If you trained them all on the same (large-sized) corpus, how much variation in the embeddings would you expect?

kceeyang commented 7 months ago

In chapter 10, the attention mechanism is described as a “retrieval process that calculates dynamic alignment weights,” which incorporates information on relevant parts of your input sequence and can be sent to the decoder for correct output. In this case, can you use attention directly on the encoder inputs without RNN since you do not have to backprop the sequence into a hidden, dynamic state?

Pei0504 commented 7 months ago

The article on "Text Learning with Attention" from Chapter 10 discusses advanced text processing techniques using Recurrent Neural Networks (RNNs), attention mechanisms, and transformers in natural language processing tasks like translation, summarization, and question-answering. When it comes to attention mechanism specificity: How do different attention mechanisms (like the ones used in the Bahdanau vs. Luong models) impact the performance and outcome of specific language processing tasks? Also, given the computational demands of models like GPT-3 and BERT, what are some effective strategies to scale these models down for use in more constrained environments without significantly compromising their effectiveness?

guanhongliu2000 commented 7 months ago

In Chapter 9, I am opaque about how can we effectively evaluate the performance of RNNs in tasks beyond traditional metrics. Consider the implications of interpretability and the ability to generalize across different types of sequence learning tasks.

Xtzj2333 commented 7 months ago

Can word embeddings distinguish between words that closely related but are opposite in meaning? For example, 'happy' and 'sad' seem to cluster closely together in semantic space, presumably because they are often used within similar contexts, but their meanings are opposite.

risakogit commented 7 months ago

Is there a method for comparing the performance of various word embedding models? If so, is it common practice to include this comparison in your paper?

CYL24 commented 7 months ago

From chapter 9, word embeddings like word2vec capture semantic similarities between words based on their distribution in language. However, they may inherit biases present in the training data, such as gender or cultural biases (certain professions might be associated with specific genders in the embeddings) The reading shows that techniques like debiasing or retrofitting can mitigate these biases. Could we elaborate further on the debiasing techniques mentioned in the text? Like how effective are these techniques in mitigating biases, and are there any limitations or drawbacks associated with them?

HamsterradYC commented 7 months ago

In the GPT series of models, auto-regression and conditional sampling are core mechanisms. I wonder, do the models of various sizes that have been recently released also employ these same principles and methods?

Marugannwg commented 7 months ago

What we know about a language model (hopefully) are the architecture/method and the source of text used. Today, we have many different open-source models and API LLM services to accomplish the task discussed in the textbook. How do we compare their differences (given we know little about their architectures and training resources)? Also, since we can create embedding of our own text corpus using those existing services, are their meaningful interpretation out of comparing the embedding created across models? Any methods other than comparing the cosine similarity to evaluate semantic differences?

HongzhangXie commented 7 months ago

In Chapters 9 and 10, we have learned many text analysis tools. I am curious about the robustness of text analysis results following translation. For instance, when we use the pipeline in transformers to summarize an English news article, and then summarize it again after it has been translated into Chinese. How consistent are the results of the two summaries? Are there translation tools that better maintain the robustness of text analysis results post-translation?

kangyic commented 7 months ago

In RNN, how do we tell we're running into vanishing gradient and exploding gradient problems? are there other problems we need to pay attention to when dealing with text data usng dl modles

hantaoxiao commented 7 months ago

What are the main types of models used for text learning with sequences? How do Recurrent Neural Networks (RNN) compare to Hidden Markov Models (HMM) in text generation?

XueweiLi1027 commented 7 months ago

Chapter 9 introduces several techniques to mitigate the effect of biased training data, like retrofitting and debiasing. I'd like to know more about the potential problems that might be caused by utilizing these techniques on the accuracy of RNN models.

erikaz1 commented 7 months ago

Is there a way to retrieve the individual dimensions of the representations of text data within or resulting from DL word embedding model? For instance, Figure 9-6 in Chapter 9 projects the data into 2D space after dimension reduction with t-SNE, and visually we have an x-dimension and a y-dimension. Is there any way to understand the makeup of these dimensions? (I'm aware this may not be the best example because this is a word2vec embedding). Intuitively, these dimensions must be capturing some kind of pattern that is more present in some parts of the data than others; these patterns are likely not captured by independent variables like “gender” or “income” but a combination of such seemingly independent (?) variables?

beilrz commented 6 months ago

I was wondering what are some practical usage of RNN and LSTM architecture nowadays. It seems to me that transformer have largely supersede the RNN and LSTM, and most modern language models (BERT, GPT) are all based on transformer. Furthermore, transformer is especially beneficial for long range dependency in the input.

00ikaros commented 6 months ago

What are the primary applications and benefits of using RNN-based models in text sequence modeling, particularly in sequence-to-sequence (seq2seq) architectures for tasks like translation and summarization? Additionally, how does the introduction of attention mechanisms enhance prediction by weighting nonsequential elements differently, and how do transformer-based models like BERT and GPT-4 leverage these mechanisms to excel in language understanding tasks?

Carolineyx commented 6 months ago

chapter 9: Could you explain the relative advantages and limitations of LSTM and GRU cells in handling long-range dependencies in sequential data? Given their structural differences, how do they perform in terms of training efficiency, memory usage, and overall effectiveness in capturing temporal dependencies?

Chapter 10: What are the primary factors that influence the effectiveness of different attention mechanisms (such as dot-product, bilinear, and multi-layer perceptron) in transformer models? How do these mechanisms impact the computational efficiency and accuracy of the models across various NLP tasks?

La5zY commented 6 months ago

How does the integration of attention mechanisms in sequence-to-sequence models improve the performance and interpretability of text transformation tasks such as translation, summarization, and question-answering compared to traditional RNN-based models?

icarlous commented 6 months ago

Can we retrieve individual dimensions of text representations from DL word embedding models? For instance, how can we understand the dimensions’ makeup in Figure 9-6, which uses t-SNE for 2D projection?

Thinking-with-Deep-Learning-Spring-2024 / Readings-Responses

Week 4: Apr. 12: Text Learning - Orienting #7