2. Counting Words & Phrases to Trace the Distribution of Meaning-fundamental

lkcao commented 7 months ago

Post questions here for this week's fundamental readings:

Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 5,7,9,11,16 —“Bag of Words”, “The Vector Space Model and Similarity Metrics”, “Using Phrases to Improve Visualization”, “Discriminating Words”, “Word Counting”. .

cty20010831 commented 7 months ago

As I went through Chapter 11 "Discriminating Words,” I initially thought that discriminating words based on distinctiveness and prevalence to distinguish two (or multiple) groups behind sounds very reasonable. There are also a number of related statistical tests to rigorously test the hypotheses. However, in the "Fictitious Prediction Problems" section, the author introduces this framework of predicting problem without direct interest and compares it with supervised approaches of classification, arguing that we may prefer models less predictive but better at identifying words indicative of particular categories. Hence, my question is what would be the real academic usage of discriminating words?

Twilight233333 commented 7 months ago

In the Discriminating Words section, the author introduces a method: How to determine the normal frequency of a word by comparing the frequency of words ina certain group with the probability of normal words (if I understand correctly)? Is this normal frequency also changing?

bucketteOfIvy commented 7 months ago

In Chapter 9, Grimmer, Roberts, and Stewart (2022) discuss the parts of speech (POS) tagging as an option for extracting useful info from text datasets (p. 92). How well do POS tagging models tend to perform on corpuses with large amounts of unique jargon on which the model was not trained? For example, on a corpus with a lot of slang in it?

Dededon commented 7 months ago

I like Chapter 9's practices to seperate sets of words with informational theory and multinomial regression methods. However, I wonder what are the theoretical (or qualitative) concerns to find out which one is the better pair of discrimination. Is it depending on the theoretic construction of the RQ?

anzhichen1999 commented 7 months ago

I’m lost while trying to understand the statistics for 11.3.2 and 11.3.3. How do you use them with the context given? Perhaps the example in the textbook is not clear enough.

ddlxdd commented 7 months ago

In "bag of words," the author mentioned the preprocess to reduce complexity, like turning words into lowercase, removing punctuation, and removing STOP words. I am just curious about the process of removing the punctuation. Like mentioned in the article, some analysts, like sentiment analysis, may want to keep certain punctuations and emojis. I am just wondering: does that mean we do not remove punctuation in sentiment analysis, or do we choose certain punctuation to be removed? And how do we decide which type to keep?

muhua-h commented 7 months ago

Reading through the “Using Phrases to Improve Visualization”, I have some doubts about authors' claim that phrases can improve the quality of a word cloud visualization. As shown in Figure 9.2, when phrases are used, one semantic meaning can be expressed in multiple different ways (e.g., foreign nation vs foreign nations), which diluted the visual importance/weight of this semantic/meaning. Overall, the word cloud representation is biased towards meanings with limited semantic representations. I am curious about Dr. Evan's opinion on using word cloud in our research.

sborislo commented 7 months ago

I find the vector space model to be an intuitive and effective way of mapping word similarities, but how does vectorization work with homonyms? For other text analysis tasks, surrounding context can be used to resolve such issues, but the mapping of individual words ostensibly makes this difficult. Do vectorization methods have ways to functionally map different versions of the same word (e.g., bear references the animal and bear* references the carrying of something)?

XiaotongCui commented 7 months ago

Chapter 5 mentions removing lowercase. However, I believe this may cause some problems since some words have different meanings when capitalized and non-capitalized. For example, Polish and polish, Turkey and turkey.

yuzhouw313 commented 7 months ago

Chapter 11 discussed the idea of discriminating words in documents, their trade-off between distinctiveness and prevalence, as well as some statistical examples. However, after reading the chapter I still find it confusing to conceptualize "discriminating words." Based on my understanding, such words are distinctive such that they can be utilized to predict documents or categories of the documents they are extracted from. While this chapter touched on the trade-off between distinctiveness and prevalence and emphasized the challenge of finding words that are both unique to a category and common enough to provide meaningful insights, how can we measure and validate the distinctiveness of discriminating words? Specifically, how can we incorporate what we read about word counting, bag of words, word embeddings etc. into this task?

joylin0209 commented 7 months ago

After reading Chapter 9, if some words contain figures of speech or imply a meaning different from the original meaning, how can we detect the usage of the words in the text? In addition, I am very interested in the "word cloud" mentioned in the text. In an undergrad class, we used a word cloud to present the tendencies of specific topics on student forums. In the process of crawling, because in Chinese, the word "No" contains the word "yes." (Similar to the relationship between" do not want" and "want", except that there are no spaces in Chinese to distinguish the number of words). Therefore, we have tried to accurately train the language model to differentiate between the two and avoid including "No" in statistics because it contains "yes." I'm curious if there are similar situations in English that need to be avoided. Or how to accurately train language models in the Chinese language system?

GuangjieXu commented 7 months ago

In Chapter 9, regarding dependency parsing, I understand that this analysis provides information on the structural relationships between words, which is highly beneficial for various language processing tasks. For instance, when conducting sentiment analysis, grasping the dependency relationships between words can lead to a more accurate assessment of whether the sentiment expressed in a text is positive, negative, or neutral. However, I am not entirely clear on how dependency parsing aids in revealing the interrelations between different entities and concepts when identifying semantic relationships between actors (such as people, organizations) and issues within a text.

naivetoad commented 7 months ago

In Chapter 11, the authors mention the fighting words method. They frame it as a probabilistic model of words which can be used to explain different levels of uncertainty generated from words of different frequencies. I wonder what exactly the levels of uncertainty the authors are referring to.

chanteriam commented 7 months ago

In "The Vector Space Model and Similarity Metrics," the author discusses the use of different measures of distance, including the Euclidean and Manhattan distances. Though I understand the use cases for the Euclidean distance, I am unsure what the authors mean by the Manhattan distance being "more robust to large differences between counts of individual words." What would be a mathematical example of this type of computation, and when would we want to use the Manhattan over the Euclidean distances?

ana-yurt commented 7 months ago

I have a question about Chapter 9 (POS). I wonder if Dependency Parsing in 9.4 is enough for the capturing of features such as literary style and tone. What other text-processing techniques might be warranted for these tasks?

erikaz1 commented 7 months ago

I'm still reviewing matrix algebra and the matrix-ification(?) of stat concepts. What does the following conclusion on page 73 (Vector Space model chapter) tell us, with regard to document similarity: "While the inner product between documents 1 and 2 and documents 2 and 3 are different, the cosine similarity is the same." What makes the vector space model superior to the bag of words model when analyzing the Federalist papers?

Vindmn1234 commented 7 months ago

I'm curious for what kind of specific tasks do we'd better remove stop words and for what tasks do we need to keep the original corpus intact before feeding it the model?

donatellafelice commented 7 months ago

chapter 7, on page 73, when discussing the cosine similarities of Hamilton and Madison to the disputed paper, they say the cosine similarity from Madison is .995 and for Hamilton is .918. They then say that Madison is 'much more similar', given that the scale of the cosine similarity is from 0 - 1, it seems surprising this would be considered much more. how do you determine what a small vs big difference is?

runlinw0525 commented 7 months ago

From Chapter 11 of the book, I feel like identifying discriminating words can be useful for data cleaning. And I wonder when using discriminating word methods in text analysis, especially in areas like politics or gender studies, how important is it to strike the right balance between how unique a word is and how often it's used within a group?

ethanjkoz commented 7 months ago

Perhaps I missed it somewhere, but a lingering question I have is how ought we go about discerning homonyms (words that are spelled the same but have different meanings)? Though I do not see this as a large issue, there could be some cases where discerning between like words and arriving at specific contexts could be important.

Caojie2001 commented 7 months ago

Parts of speech (SOP) tagging is introduced in chapter 9 as an assistive technology for content analysis and introduce its role in sentiment analysis and meaning extraction. My question is, how to make sure that identical words with different POS are correctly assigned? If dependency parsing is used to aid this process, how can it be done for languages with more confusing ordering rules such as Chinese?

beilrz commented 7 months ago

A question I had after reading the "Bag of Words" and "vector space model" chapters is how do you determine what words to include and the length of words in the model: similar to a question above, what pre-processing of the words should you do for a given a research question? what are some factors to consider? Furthermore, when is the method that only consider individual words appropriate (Bag of Words, ngrams..) to use? are they only for simple research questions (for example, topic frequency), or can they be used to understand complex semantic questions?

alejandrosarria0296 commented 7 months ago

Taking into consideration the method agnostic approach, I assume that there is not one unique method for grouping simolar words under a stem. How would different theoretical approaches changes this processes?

yunfeiavawang commented 7 months ago

Most of the techniques introduced in these chapters are widely used in English or other Latin languages, which means the words are naturally separated. For other languages such as Chinese, Japanese, etc, tools like sentence separators could be necessary before analyzing language chunks. Could we cover some resources for the tools in class? Will these tools perform well in NLP tasks?

Marugannwg commented 7 months ago

I'm very interested in Chapter 11 on various ways to discriminate words and fictitious predication problem to pinpoint those features that characterize a particular group. I'm pondering this topic with the following example (and I'm a little confused):

Assume I have two sources of texts: one is about advertisement for zoos, and the other is for amusement parks (the key here is they are different in subject), and I want to learn the differences in the whatever strategy or tone differences. By discriminating word, what exactly can I achieve here?

Quote from p 117: Fictitious prediction problems also highlight a potential risk when identifying discriminating words. We set up the problem to identify words that are associated with a particular category, but there might be other characteristics of the document correlated with the category

I clearly know that the subject (zoo versus amusement part) can lead to big issues, how to rule this out and control for the subject-specific differences?

YucanLei commented 7 months ago

I don't understand on preprocessing text, particularly if it is in a different language that is not latin-based or Germanic?

LyuZejian commented 7 months ago

In Chapter 7, the author promoted the strength of the model proposed, for it does not rely on knowledge. I am interested in how we could strengthen the model by adding knowledge. For example, using genre-specified stop-words or placing weights on word tokens to cast various emphasis. Add text corpora to vectorize might also be a promising method. Introducing experts to annotate or prioritize words, tokens or documents is also worth exploring.

chenyt16 commented 7 months ago

In Chapter 7, the author introduced how to use the Vector Space Model to measure similarities across documents. The methodology is based on the number of the same words are used. I question whether it works in the following two conditions and if it doesn't, how we can improve our analysis: (1) The number of the same words is relatively mere, but the meaning of those different words is similar. For example, a popular science article targeting children and adults may use different wording or terminology but deliver similar information. (2) This methodology may falter in texts that share similar content but express opposing attitudes. For example, negative versus positive film reviews.

HamsterradYC commented 7 months ago

For chapter 11, the concept of 'Discriminating Words' is used to identify words characteristic of certain document categories, like an author's gender or political affiliation. It's mentioned that there's a tension between words distinctive between groups and words prevalent within a group. Given this, how do the proposed methods balance the trade-off between a word's distinctiveness and its prevalence?

XiaodiYang2001 commented 7 months ago

I think this is a very good book that helped me learn text analysis methods. I learned how to judge the language characteristics of a group through word frequency and so on. I love its example of discriminating word algorithms to discover the cognitive frameworks of feminist movements in the United States, which is highly related to my research project. But the book involves a lot of mathematical knowledge. To be honest, because of these mathematical formulas, there are many places where I don’t quite understand the specific data analysis steps. My question is: for different languages, how to solve the problem of semantic changes and loss during the translation process?

michplunkett commented 7 months ago

I thoroughly enjoyed the nuance presented by the stop-word sections in the reading. When I look at my own writing, I see a handful of stop-words getting used at a higher frequency than others, and I am aware that those would be markers for what writing is mine and is not. Something I am curious about, and this may apply to all academic research: given the many ways to do text analysis and the various angles one could take, how do you decide when enough analysis is enough? It feels like it'd be easy and valid to get wrapped up in the notion that you need to apply EVERY means of analysis before coming to an answer for your hypothesis, but that wouldn't be sustainable for one's career to progress meaningfully.

Carolineyx commented 7 months ago

I really enjoyed learning about the detailed and different ways that we can analyze words. At the same time, I'm curious. What if we want to study the meaning of words? Where might the next sentences following the words (or phrase) be? We're not sure how many sentences there are, or even if they're connected together. How can we apply these methods to prepare the data and analyze that?

yueqil2 commented 7 months ago

I have to say that Chapter 9 is fascinating for a political scientist, no matter the conceptual tools or examples of application. The series of related tasks in this chapter can extract facts and even events between varied entities, which is really matters in our discipline. But can they complete information extraction about political intention? If they can, how to do so and what's the difference between extracting facts and intention?

QIXIN-LIN commented 7 months ago

After reading a part of Chapter 9, which focuses on Named Entity Recognition (NER), I'm wondering whether NER can be effectively applied to tag entities beyond the standard categories of people, organizations, and locations? Specifically, I'm interested in its application to unstructured job-posting analysis. Additionally, what is the efficacy and availability of NER tools in a multilingual context?

Brian-W00 commented 7 months ago

How do these techniques handle the complexity and subtleties of language, especially in texts with rich cultural nuances or technical jargon? Furthermore, what are the implications of these methods for identifying and interpreting underlying themes or sentiments in large datasets, and how do they address the challenges of linguistic diversity and evolving language use in digital communication?

floriatea commented 6 months ago

Given that vector space models are fundamentally linear, how might they be adapted or extended to better capture non-linear linguistic phenomena, such as humor, metaphor, or sarcasm?

JessicaCaishanghai commented 5 months ago

In the context of language models and text analysis, how can we effectively differentiate between words that contain figures of speech or imply a meaning different from the original meaning, and those that do not, especially in languages like Chinese where words may be combined without spaces?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

2. Counting Words & Phrases to Trace the Distribution of Meaning-fundamental #50