2. Counting Words & Phrases - fundamental

JunsolKim commented 2 years ago

Post questions here for this week's fundamental readings: Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 5,7,9,11,16 —“Bag of Words”, “The Vector Space Model and Similarity Metrics”, “Using Phrases to Improve Visualization”, “Discriminating Words”, “Word Counting”.

pranathiiyer commented 2 years ago

Hi all, Chapter 6 talks about regularisation as adding pseudo data to push estimates away from zero. I was curious to know how one arrives at the value of alpha or the kind of data that must be added to a sample to avoid overfitting? It would also be helpful to understand how regularisation works on a prior distribution as is mentioned in the chapter?

hsinkengling commented 2 years ago

A lot of the methods introduced here require some form of human interference, such as predefining phrases to code as n-grams, selecting keywords, selecting reference texts, or selecting stopwords. For many of these tasks, the human involved may need to make an arbitrary decision out of a wide range of possible choices. For a bad example, say "computational", "social", "science" are candidates for custom stop words at a computational social science conference. In this case, there are 2^3 possible choices a human can make in terms of whether they are stop words or not.

Since it would be computationally costly to go through each of those choices and run and validate the model 8 times (or more), how should researchers navigate this space of arbitrary decisions? (MTurk maybe?)

GabeNicholson commented 2 years ago

A lot of the methods introduced here require some form of human interference, such as predefining phrases to code as n-grams, selecting keywords, selecting reference texts, or selecting stopwords. For many of these tasks, the human involved may need to make an arbitrary decision out of a wide range of possible choices. For a bad example, say "computational", "social", "science" are candidates for custom stop words at a computational social science conference. In this case, there are 2^3 possible choices a human can make in terms of whether they are stop words or not.

Since it would be computationally costly to go through each of those choices and run and validate the model 8 times (or more), how should researchers navigate this space of arbitrary decisions? (MTurk maybe?)

Great question. In the textbook they mention a very clever way to solve this by taking the probability of seeing the pairs of words together compared to the probability of seeing each word by itself. So in your example, the probability of seeing "Computational Social Science", together would be much higher than the probability of seeing each of the three words independently. Therefore, we would count computational social science as a single word. For a large corpus these types of calculations become much harder. Also, stop words are typically very high frequency words that aren't nouns. Just by looking at the top 30 or top n words in a corpus can give you a good idea for what words to leave out.

facundosuenzo commented 2 years ago

Chapter 5 discusses different processes for reducing complexity. Is it recommended to analyze first a sample to get an idea of which words, for instance, we should "lemmarize"? (and to understand which is the best model to approach the analysis) How should we select this subsample? If the order of the steps matter and affect the final document that we will work with, could it be, for example, filtering by frequency to understand what words are prevalent but valuable a good way of starting? Secondly, I was thinking if in terms of validity, what are the consequences of considering each word as generated independently of all other words under the multinomial distribution model? Is this independence within the document of across documents?

isaduan commented 2 years ago

On the use of "text reuse" or "plagiarism detection" - the assumption is that common sequences of text that appear across a pair of documents that are sufficiently long that it is improbable they appeared by chance. But this assumption seems to error on the side of false negative over false positive (e.g. the use of paraphrase). Is this an accurate interpretation of what the method's doing?

LuZhang0128 commented 2 years ago

Chapter 5 introduced a standard procedure after we got a bunch of documents and want to find a basic general trend. This is the default process that I used during research. However, chapter 5.6 points out that we need to rethink this process. I wonder if lemmatizing or removing stop words do not improve the performance/efficiency of the program dramatically, should we just not do this?

hsinkengling commented 2 years ago

A lot of the methods introduced here require some form of human interference, such as predefining phrases to code as n-grams, selecting keywords, selecting reference texts, or selecting stopwords. For many of these tasks, the human involved may need to make an arbitrary decision out of a wide range of possible choices. For a bad example, say "computational", "social", "science" are candidates for custom stop words at a computational social science conference. In this case, there are 2^3 possible choices a human can make in terms of whether they are stop words or not. Since it would be computationally costly to go through each of those choices and run and validate the model 8 times (or more), how should researchers navigate this space of arbitrary decisions? (MTurk maybe?)

Great question. In the textbook they mention a very clever way to solve this by taking the probability of seeing the pairs of words together compared to the probability of seeing each word by itself. So in your example, the probability of seeing "Computational Social Science", together would be much higher than the probability of seeing each of the three words independently. Therefore, we would count computational social science as a single word. For a large corpus these types of calculations become much harder. Also, stop words are typically very high frequency words that aren't nouns. Just by looking at the top 30 or top n words in a corpus can give you a good idea for what words to leave out.

Thanks, I missed that part. I think I mean to ask it in a broader context. Like how would researchers go about making arbitrary decisions for the parts that require human interferences in general.

ValAlvernUChic commented 2 years ago

Chapter 5.6 mentions that in default stemming and stop word removal might be detrimental to topic model performance. Does anyone happen to have an example of this? Also, the removal of stop words seems most needed for the bag of words method (prolly why it's under the chapter) but I was wondering whether there are cases where we'd wanna remove it if we were interested in n-grams. Intuitively seems like removing them might then affect phrase meaning quite drastically

konratp commented 2 years ago

In chapter 11, the authors show how emphasizing different aspects of an analysis (e.g. prevalence vs distinctiveness), one can come to very different conclusions in analyzing a set of texts. In the example, analyzing Nelson's (2020) study of feminist movements in New York and Chicago, nothing fundamentally changed about the results, though the underlying explanation did. However, what should a social scientist do when they discover contradictions in the data when analyzing a sample of texts placing different emphases on prevalence vs distinctiveness?

mikepackard415 commented 2 years ago

Dependency parsing and the subsequent extraction of names, relations, and events (Chpt 9) sounds really interesting and useful. The authors mention that these methods are "relatively underutilized in the social sciences." Do we have a sense for why that might be? Are these methods fraught with unforeseen challenges, or are they just not as useful as they sound?

ZacharyHinds commented 2 years ago

Chapter 9 discusses the use of phrases to improve visualizations. The method of analyzing text reuse was particularly interesting to me, as it seemed potentially very powerful in doing comparative analyses. That said, as they mention, it is computationally expensive, but they indicated methods and tools which are expanding access. I wonder, then, how can this and other methods be even further improved to account for the issues they've mentioned previously about the loss of specific context or meaning when we normally strip away to much for the sake of computational efficiency?

sudhamshow commented 2 years ago

I am a little unclear about the formulation of the tf-idf metric. The author reiterates in chapter 6 that it is beneficial to eliminate from the corpus both largely frequent words (as they do not provide much distinction between articles) and rare words (results wouldn't be generalisable) (The sweet middle portion of the Zipf's law curve). But since the tf-idf weights are proportional to log(N/n_j), while it gives lesser prominence to frequent words, does it not amplify rare words?

Also, how are cases where n_j = 0 handled where the metric becomes undefined?

hshi420 commented 2 years ago

In the dictionary methods the author mentioned that the researchers might need to change the dictionary when applying to other datasets. How exactly can we modify an existing dictionary list so that it can be applied to a new dataset? Will this be more time efficient than constructing a new dictionary list?

zixu12 commented 2 years ago

Chapter 5 discusses the unit of analysis, and a question in general just popped into mind: nowadays in fact there are rich text resources, and it is just impossible for us to collect and analyze all the data even they are avaliable. I am wondering are there any rules when selecting the sample size? or is it generally the larger the better?

NaiyuJ commented 2 years ago

I have a question about the tokenization of words. In English, we usually segment words using whitespaces. In other languages like Chinese and Japanese, the authors mention that we need a word segmentation model. In this situation, we can either use the word segmentation model or translate the text into English. I'm wondering which one is better. In what situation should we choose what.

sizhenf commented 2 years ago

Adding on to @NaiyuJ's comment: First, in my opinion, it sounds more preferable to use a word segmentation model to tokenize texts in character-based languages such as Chinese and Japanese, since if we translate them in to English, then we lose many information in the process of translation.

Second, in my understanding, word segmentation models tokenize the text with the help of a dictionary, but suppose how texts contains some word that are not in the dictionary (for instance, trendy internet phrases), then it seems to me that word segmentation models may not perform as well in identifying them. My question is, in general, what options do we have in dealing with these words?

Emily-fyeh commented 2 years ago

Among this week's chapters, I am impressed by the exemplary case in chapter 11, where Nelson (2020) differentiates the wordings of Chicago and New York feminism organizations. I think it is an optimal match for the research question and methodology. I am just curious about how this approach (and the alternative fictitious prediction problems) would work in a dataset where the categories are unknown/uncertain? Perhaps the interpretation part would be tricky just like trying to interpret each topic when conducting topic modeling to blurry data.

YileC928 commented 2 years ago

I have a few small questions regarding Chapters 5 and 7.

The author said that in some cases splitting text into smaller documents can substantially increase or decrease the computational cost of fitting a model and may make the model more or less statistically efficient. Just wondering if there are any examples for those scenarios.
How to deal with words that change meaning when turned to lowercase, e.g., China and china?
Among the properties for similarity measurement, the second one suggests ‘two documents that share no words in common should have the minimum similarity’. What if one sentence perfectly paraphrases the other that the two use completely different words but carry the same meaning?

MengChenC commented 2 years ago

For the similarity calculation, we can employ different metrics, and they carry different information. It is hard to say there is an overall metrics meeting all the research questions or datasets. I am wondering if we can combine several metrics into the machine training part to get a better performance? Think of ensemble, if individual models carry various/non duplicate information, then ensemble is always better (at least not worse) than each individual model.

chentian418 commented 2 years ago

I have a question about chapter 7 regarding the vector space and similarity metrics. As the cosine similarity can represent the angle between the two documents regardless of the magnitude of the vectors. I am still a bit confused about the meaning of cosine similarity in representing the distances between two words, how does the angle reflect the distances in specific context? Moreover, I am curious about how to interpret the angle in the social science context? Thanks!

Qiuyu-Li commented 2 years ago

Adding on to the comments of @NaiyuJ and @sizhenf: I mostly agree with Sizhen's idea that direct tokenization would keep maximize information retention. However, the Orientation paper in this week actually shows us an exception, where books in different languages are translated in English and then kept in Google's enormous database. Translation seems to be a more efficient way when it involves cross-language comparative studies.

And this leads to a new idea: Wouldn't it be interesting for linguists to study how different languages are preprocessed in NLP in different ways? I imagine that this would shed some lights on how languages differ from others in some sort of "machine structure".

Jiayu-Kang commented 2 years ago

Adding on to the comments of @NaiyuJ and @sizhenf: I mostly agree with Sizhen's idea that direct tokenization would keep maximize information retention. However, the Orientation paper in this week actually shows us an exception, where books in different languages are translated in English and then kept in Google's enormous database. Translation seems to be a more efficient way when it involves cross-language comparative studies.

And this leads to a new idea: Wouldn't it be interesting for linguists to study how different languages are preprocessed in NLP in different ways? I imagine that this would shed some lights on how languages differ from others in some sort of "machine structure".

I don't think there is necessarily a tradeoff between tokenization and translation though. My understanding is that appropriate tokenization may be utilized to improve the accuracy of translation. But I do like the research question that you raised - it would definitely be interesting to look at the linguistics behinds different ways to preprocess languages!

Sirius2713 commented 2 years ago

I am a little unclear about the formulation of the tf-idf metric. The author reiterates in chapter 6 that it is beneficial to eliminate from the corpus both largely frequent words (as they do not provide much distinction between articles) and rare words (results wouldn't be generalisable) (The sweet middle portion of the Zipf's law curve). But since the tf-idf weights are proportional to log(N/n_j), while it gives lesser prominence to frequent words, does it not amplify rare words?

Also, how are cases where n_j = 0 handled where the metric becomes undefined?

I think tf-idf amplifies words that are rare in the whole documents but common in some specific documents. Because if a word is too rare, even though the log(N/n_j) part will become large, but the term frequency will be too small to make the word important. And when n_j=0, I think ppl usually use n_j+1 instead of n_j.

melody1126 commented 2 years ago

For the distance metrics in chapter 7.2, what inferences can we make from measuring the distance overtime between two phrases (liberal education v.s. vocational / employment)? For instance, can I make inferences about how the ideas behind both phrases have developed and moved further apart from each other?

kelseywu99 commented 2 years ago

For the regularization in the fighting words' algorithm in chapter 11.2, the author mentioned that this method may encounter challenges when being applied to a continuous category, "such as the ideology of a speaker". I was curious why continuous categories may not be treated as a set of discrete categories over time?

Hongkai040 commented 2 years ago

Section7 The Vector Space Model and Similarity Metrics tells the big idea of using documents as vectors to measure similarity and distances. It tells us how to assign weights to features, but how to select features within documents to construct the vector space?

ttsujikawa commented 2 years ago

For the tf-idf metrics, I understand it is a highly efficient way to define a hierarchy of importance of each term. However, even though we remove largely frequent and rare words, there may exist too big gaps among frequencies of terms that over-represent the importance of each term. How could we overcome this issue?

Jasmine97Huang commented 2 years ago

The fundamental reading discusses the advantage of cosine similarity as it normalized by magnitudes of the vectors. The chapter discusses these similarity metrics in the context of document vectors where values are word counts, I am wondering for word-level embeddings, does the magnitude matter also?

yuzhouw313 commented 10 months ago

Chapter 11 discussed the idea of discriminating words in documents, their trade-off between distinctiveness and prevalence, as well as some statistical examples. However, after reading the chapter I still find it confusing to conceptualize "discriminating words." Based on my understanding, such words are distinctive such that they can be utilized to predict documents or categories of the documents they are extracted from. While this chapter touched on the trade-off between distinctiveness and prevalence and emphasized the challenge of finding words that are both unique to a category and common enough to provide meaningful insights, how can we measure and validate the distinctiveness of discriminating words? Specifically, how can we incorporate what we read about word counting, bag of words, word embeddings etc. into this task?

XiaotongCui commented 10 months ago

Chapter 5 mentions removing lowercase. However, I believe this may cause some problems since some words have different meanings when capitalized and non-capitalized. For example, Polish and polish, Turkey and turkey.

joylin0209 commented 10 months ago

After reading Chapter 9, if some words contain figures of speech or imply a meaning different from the original meaning, how can we detect the usage of the words in the text? In addition, I am very interested in the "word cloud" mentioned in the text. In an undergrad class, we used a word cloud to present the tendencies of specific topics on student forums. In the process of crawling, because in Chinese, the word "No" contains the word "yes." (Similar to the relationship between" do not want" and "want", except that there are no spaces in Chinese to distinguish the number of words). Therefore, we have tried to accurately train the language model to differentiate between the two and avoid including "No" in statistics because it contains "yes." I'm curious if there are similar situations in English that need to be avoided. Or how to accurately train language models in the Chinese language system?

volt-1 commented 10 months ago

In Chapter 11 of 'Text as Data,' discusses 'Discriminating Words' in political context. In this context of the dynamic evolution of political language, how can researchers tell apart words that show real ideological differences from those that are just for style or effect? Also, how do they handle changes in the meaning of words over time? For example, the word "patriotism" might have been used differently during times of war compared to times of peace.

sborislo commented 10 months ago

I find the vector space model to be an intuitive and effective way of mapping word similarities, but how does vectorization work with homonyms? For other text analysis tasks, surrounding context can be used to resolve such issues, but the mapping of individual words ostensibly makes this difficult. Do vectorization methods have ways to functionally map different versions of the same word (e.g., bear references the animal and bear* references the carrying of something)?

UChicago-Computational-Content-Analysis / Readings-Responses-2023

2. Counting Words & Phrases - fundamental #47