Open JunsolKim opened 8 months ago
From reading "Text Learning with Attention" - is perplexity used during the training process or in feed forward?
I am wondering about current seq2seq applications. Chapter 10 outlines some limitations with translating in sequence. For example, translating the French phrase “le chat noir” to “black cat” presents challenges because the order of words is not consistent, so in the literal/sequential translation it would output “cat black” instead of “black cat”. To my understanding, the alternative method for translating short sentences is outlined through LSTM methods. Is this what the current and successful translating services, like Google Translate, use? Or do LSTM implementations only work on shorter texts?
A question on next token generation: why does randomly sampling from a probability distribution proportional to the score size (vs. just choosing token with the highest score) improve model performance?
What are the comparative effects of fine-tuning different transformer-based models (such as BERT) on domain-specific tasks like legal document analysis or crime classification, in terms of both accuracy and computational efficiency? In a previous project, I noticed that GPT-4 already have a high accuracy by self-asking prompt engineering, with an inter-model reliability (with human beings considered as a model) of 0.916?
I am curious about how different word embedding models (word2vec-CBOW, word2vec-skipgram, FastText, GloVe) perform relative to each other. As we saw in the HW for last week, when trained on a small corpus, word embedding models may reach very different estimates of word similarities. Is the same true for these different models? If you trained them all on the same (large-sized) corpus, how much variation in the embeddings would you expect?
In chapter 10, the attention mechanism is described as a “retrieval process that calculates dynamic alignment weights,” which incorporates information on relevant parts of your input sequence and can be sent to the decoder for correct output. In this case, can you use attention directly on the encoder inputs without RNN since you do not have to backprop the sequence into a hidden, dynamic state?
The article on "Text Learning with Attention" from Chapter 10 discusses advanced text processing techniques using Recurrent Neural Networks (RNNs), attention mechanisms, and transformers in natural language processing tasks like translation, summarization, and question-answering. When it comes to attention mechanism specificity: How do different attention mechanisms (like the ones used in the Bahdanau vs. Luong models) impact the performance and outcome of specific language processing tasks? Also, given the computational demands of models like GPT-3 and BERT, what are some effective strategies to scale these models down for use in more constrained environments without significantly compromising their effectiveness?
In Chapter 9, I am opaque about how can we effectively evaluate the performance of RNNs in tasks beyond traditional metrics. Consider the implications of interpretability and the ability to generalize across different types of sequence learning tasks.
Can word embeddings distinguish between words that closely related but are opposite in meaning? For example, 'happy' and 'sad' seem to cluster closely together in semantic space, presumably because they are often used within similar contexts, but their meanings are opposite.
Is there a method for comparing the performance of various word embedding models? If so, is it common practice to include this comparison in your paper?
From chapter 9, word embeddings like word2vec capture semantic similarities between words based on their distribution in language. However, they may inherit biases present in the training data, such as gender or cultural biases (certain professions might be associated with specific genders in the embeddings) The reading shows that techniques like debiasing or retrofitting can mitigate these biases. Could we elaborate further on the debiasing techniques mentioned in the text? Like how effective are these techniques in mitigating biases, and are there any limitations or drawbacks associated with them?
In the GPT series of models, auto-regression and conditional sampling are core mechanisms. I wonder, do the models of various sizes that have been recently released also employ these same principles and methods?
What we know about a language model (hopefully) are the architecture/method and the source of text used. Today, we have many different open-source models and API LLM services to accomplish the task discussed in the textbook. How do we compare their differences (given we know little about their architectures and training resources)? Also, since we can create embedding of our own text corpus using those existing services, are their meaningful interpretation out of comparing the embedding created across models? Any methods other than comparing the cosine similarity to evaluate semantic differences?
In Chapters 9 and 10, we have learned many text analysis tools. I am curious about the robustness of text analysis results following translation. For instance, when we use the pipeline in transformers to summarize an English news article, and then summarize it again after it has been translated into Chinese. How consistent are the results of the two summaries? Are there translation tools that better maintain the robustness of text analysis results post-translation?
In RNN, how do we tell we're running into vanishing gradient and exploding gradient problems? are there other problems we need to pay attention to when dealing with text data usng dl modles
What are the main types of models used for text learning with sequences? How do Recurrent Neural Networks (RNN) compare to Hidden Markov Models (HMM) in text generation?
Chapter 9 introduces several techniques to mitigate the effect of biased training data, like retrofitting and debiasing. I'd like to know more about the potential problems that might be caused by utilizing these techniques on the accuracy of RNN models.
Is there a way to retrieve the individual dimensions of the representations of text data within or resulting from DL word embedding model? For instance, Figure 9-6 in Chapter 9 projects the data into 2D space after dimension reduction with t-SNE, and visually we have an x-dimension and a y-dimension. Is there any way to understand the makeup of these dimensions? (I'm aware this may not be the best example because this is a word2vec embedding). Intuitively, these dimensions must be capturing some kind of pattern that is more present in some parts of the data than others; these patterns are likely not captured by independent variables like “gender” or “income” but a combination of such seemingly independent (?) variables?
I was wondering what are some practical usage of RNN and LSTM architecture nowadays. It seems to me that transformer have largely supersede the RNN and LSTM, and most modern language models (BERT, GPT) are all based on transformer. Furthermore, transformer is especially beneficial for long range dependency in the input.
What are the primary applications and benefits of using RNN-based models in text sequence modeling, particularly in sequence-to-sequence (seq2seq) architectures for tasks like translation and summarization? Additionally, how does the introduction of attention mechanisms enhance prediction by weighting nonsequential elements differently, and how do transformer-based models like BERT and GPT-4 leverage these mechanisms to excel in language understanding tasks?
chapter 9: Could you explain the relative advantages and limitations of LSTM and GRU cells in handling long-range dependencies in sequential data? Given their structural differences, how do they perform in terms of training efficiency, memory usage, and overall effectiveness in capturing temporal dependencies?
Chapter 10: What are the primary factors that influence the effectiveness of different attention mechanisms (such as dot-product, bilinear, and multi-layer perceptron) in transformer models? How do these mechanisms impact the computational efficiency and accuracy of the models across various NLP tasks?
How does the integration of attention mechanisms in sequence-to-sequence models improve the performance and interpretability of text transformation tasks such as translation, summarization, and question-answering compared to traditional RNN-based models?
Can we retrieve individual dimensions of text representations from DL word embedding models? For instance, how can we understand the dimensions’ makeup in Figure 9-6, which uses t-SNE for 2D projection?
Post your questions here about: “Text Learning with Sequences”, “Text Learning with Attention”, Thinking with Deep Learning, chapters 9 & 10