Text Learning -Orientation

lkcao commented 2 years ago

Post your questions here about: “Text Learning with Sequences (Links to an external site.)” OR “Text Learning with Attention (Links to an external site.)”--Thinking with Deep Learning, chapters 9 & 10.

pranathiiyer commented 2 years ago

We've talked about how embeddings tend to capture the bias of the corpus they are trained on. I was wondering what happens when these are used as pretrained weights for other language models? Is the bias exacerbated by any systemic bias?

borlasekn commented 2 years ago

In Chapter 9, you talk about the cultural biases (such as a gender bias) that can be picked up through word embeddings. Is there a best practice for checking the "accuracy" of these types of models? My thoughts would be that you should really just measure these against literature, but what if the literature has missed something and your model picks up on something that you don't know how to describe? How would you ensure that your model is accurately displaying something and not picking up on noise or some other feature of the data that is not representative of society?

sabinahartnett commented 2 years ago

We've talked about how embeddings tend to capture the bias of the corpus they are trained on. I was wondering what happens when these are used as pretrained weights for other language models? Is the bias exacerbated by any systemic bias?

Interesting question in the context of this week's CSS workshop! (Where Professor Wachter discussed the ways that many training / ML biases result from systemic bias (and are cyclically continued by their appearance in the models).) Is there a 'neutral' model or baseline to compare our own trained embeddings to? And 'neutral' might be context specific (i.e. you could try to manipulate training data and weights to produce a model that might not contain gendered biases in word embeddings, but that might make it difficult to also account for age or race based biases.

ValAlvernUChic commented 2 years ago

I was curious about using fine-tuned BERT models for temporal analysis. It's common to align multiple word embedding models for temporal analysis and even use dynamic word embeddings, but I'm wondering whether the same can be done with BERT models to facilitate semantic change analysis. Is it worth doing/does it even make sense to do this with BERT models?

isaduan commented 2 years ago

If we are working with longer texts, when is it a good practice to cut them into smaller pieces? It seems like our decision should be based on the size of embedding, our language models, and our expectation of how similar vs. different each segment of the long texts is to one another?

thaophuongtran commented 2 years ago

From the readings, I understand that usage of methods/models heavily relies on the purpose of researchers: predictions, classifications, comparisons, summarization, translation, etc. Given limited training data, memory, and computational power, pre-trained models or transfer learning come into the pictures. What are some centralized resources for these pre-trained models? I imagine there are many corresponding pre-trained models for text data out there.

javad-e commented 2 years ago

I recently encountered a problem when trying to use text learning methods for analyzing text from languages with non-Latin alphabets. I was wondering if any of the mentioned packages and approaches could be used directly without additional processing in such cases? I have seen some researchers Google Translating all the text or others replacing each character of their text with a combination of the Latin alphabet and then using unsupervised methods. What is the optimal processing approach?

ShiyangLai commented 2 years ago

I am wondering whether the bias captured by word embedding models can be potentially biased. Word embedding models are built on the corpus they are trained on. Corpus themselves, however, are always biased already. Neither twitter corpus nor wiki corpus can represent general linguistic space. In this sense, the "bias" we captured via different word embedding models is not based on the same coordinate.

Yaweili19 commented 2 years ago

This chapter really offers me approaches to improve on my current reddit analysts into deep-learning level. My question, though, being that the efficacy of FastText. In the example provided, the word "which" is vectorized as the average of its 5 sub words. This makes little sense to me considering how English works. Is it just designed for niche subword-rich environment? What are the example of such environments>

BaotongZh commented 2 years ago

In chapter 9 and 10, the author introduces a lot of models in dealing with text data(for sequence to sequence model). I was just wondering the possibility and the ways that we could translate the voice in one language to text in the other language by using multiple models mentioned in those chapters.

zihe-yan commented 2 years ago

Adding to the bias discussion, last week we discussed bias that's not necessarily a bad thing. Is there a way that we could form a research question that actually takes the advantage of bias? Certainly, the meaning of the word PRC will be learned with a drastic difference between an English NYT-based model and a Chinese People's-Daily-based model. But somehow to me, it's like looking through an event from different theoretical perspectives. It's natural that we get different conclusions. In this scenario, we are not focusing on training the model to reduce bias, instead, we focus on the bias. Will this be one of the solutions to the problem of bias?

yujing-syj commented 2 years ago

For the several models that we have already learnt, many researchers are trying to combine several models together for better results. What are some tips for combine different models together for modeling? Also, when there are several languages of text data, what are the best way of dealing with several languages? Do we need to translate the foreign languages into English then apply the text models or we could directly use the text models for that specific language?

linhui1020 commented 2 years ago

Based on Isabella's question, I am also interested why when building the model, we always select a maximum number of words or sentences in a document, for example, the first 80 or the first 100, instead of using the full document. In addition, in what case, shall we not ignore the length of documents? Up to now, we always feeding all documents, without knowing the difference of longer texts and shorter texts. But usually, for example, a user writes longer posts may show more interest than those writing shorter posts. If we ignore the length of documents, will we miss some interesting information?

mdvadillo commented 2 years ago

I have a question on testing the accuracy and strengths of the models. Taking for example word2vec, using the skipgram algorithm results in a better ability to represent rare words and phrases than you get when using CBOW, but CBOW is better for representing frequent words. How do we define rare words and frequent/common words? With respect to how many times they appeared in the corpus or is it a more subjective definition, like if the researchers themselves found some words to be rare or unusual

Emily-fyeh commented 2 years ago

My question is relevant to those of some peers--about the different word counts/length of text inputs. I would like to know if the padding of the input sentences would influence the fine-tuning of the BERT Model. Since in most cases, the sentences have different lengths, and for social media data, the discrepancies would be even larger. Would this space padding mitigate the effectiveness of the multi-layer model structure?

chentian418 commented 2 years ago

I have a question about fasttext that I am curious about the possible improvement technologies. Is there a newer version of fasttext with substantial reduction in the model size? As the authors mentions, a possible way could be conditioning on the size of the vectors based on their frequency, putting less weights on the rare labels.

Moreover, instead of pruning out the useful features, has there been any newer implementation of fasttext that decompose them into smaller units? According to the authors, this may help in the case where training and test examples are very short, such as a single word.

yhchou0904 commented 2 years ago

Based on the reading and some previous works, I have a question regarding the length of documents we want to use as input for our models. For now, we see more embedding of words, sentences, and documents. I am just wondering that when a document is too long and we also want to take a look at the paragraph of the document, is there a way for us to do paragraph embedding and still keep the fact that they are from the same document/article? If yes, how should we implement that in a proper structure?

y8script commented 2 years ago

I'm interested in the interpretation of 'attention' in attention-based language models. Why do they make these analogies? Is it possible that we can draw some similar patterns or characteristics in human language processing that may help to justify the model architecture? It seems that this attention is distinct from the human attention, but what could it implicate?

min-tae1 commented 2 years ago

I am curious if the methods of text analysis that we read could also be applied to other genres. Most of the examples seem to be done in non-fiction, where it is relatively easy to find out the meaning of a word. However, poems, novels, scenarios contain words that could have diverse interpretations, and I am curious whether text analysis could work or what addition may be required to expand the scope of analysis to genres other than non-fiction.

Hongkai040 commented 2 years ago

A question for multi-head. I can understand the description of it. However, take the example "sentence “I saw the suspect with a telescope”. I don't know the correct way of interpretation, how can multi-head mechanism successfully determine which is the right way of interpretation? So what kinds of criterions do they use?

hsinkengling commented 2 years ago

I'm still having trouble understanding the visualization for attention. It seems that most of the attention is paid to the start and separator tokens in each sentence. How does this help us understand the performance of neural networks?

sudhamshow commented 2 years ago

A couple of questions regarding the attention mechanism and transformers -

It still seems metaphysical to me how attention mechanism accesses all words in parallel, how are these states stored processed?
Why do Transformers perform so much better than (gated - LSTM/GRU) Bidirectional RNN? I understand that in RNNs the data is lost due to compression, but it achieves much greater performance in generative tasks. So why does having full random access to words make a model perform so much better on downstream tasks?
One of the main criticism of RNNs were that they were slow to train (sequential processing) and a method that accessed words in parallel (multi-headed attention) performed much better (training time, accuracy). I contradiction however the time complexity of transformers (dN^2 + Nd^2) is much worse than (d*N^2) when the sequence length is more than the parameters (which is most of the times) since attention needs to be applied to each word pair (NxN). Why do Transformers then still achieve a smaller training time than RNNs for similar tasks and data?

Thinking-with-Deep-Learning-Spring-2022 / Readings-Responses

Text Learning -Orientation #5