Open jamesallenevans opened 4 years ago
It is mentioned in the page 184 that one of the method to examine whether a combination is a collocation is to translate this combination into another language and to see if the word to word translation persists. An example is to translate "make a decision" into French and the translation does not make sense. However, there are so many languages in this world that I believe there is always some languages for which the translation of "make a decision" makes sense (for example Chinese). And similarly, there are always languages for which the translation does not make sense. So could you please explain this point a little more?
I'm interested in the method that uses the mean/variance of distance between words to identify collocations. On page 159, the authors state that pairs of words with low deviation is an indicator of a collocation. I'd like to better understand how a researcher can determine what is a "low" deviation number versus a "high" deviation. Are there experiments or simulations that can be run to determine what is an optimal threshold?
Based on the article that uses non-compositionality, non-substitutability and non-modifiability as criteria for a collocation, it seems that the methods of frequency, mean-variance or mutual information increases likelihood of finding a collocation but could ultimately find words that are often used together. Is this something that is a challenge in practice? If so, are practices to address these?
I have a question regarding how mutual information is defined using the probability of collocation occurrence and also a few other definitions in the previous sections (like in p.173). Why are probabilities here taking the form of log? Is it taken to get increasing rate in probability? Or is it aiming at capturing a certain type of distribution?
Though this chapter has make it clear that collocation is important and qualitatively different from other 2-grams, it seems that in most computational content analysis applications in social sciences, researchers would not distinguish collocations between the other phrases. Would it be a huge problem to base analyses on n-grams that do not account for the fact that some may not be compositional?
The chapter provides motivation for analyzing collocations through citing Firth's Contextual Theory of Meaning, noting the benefit of viewing words instead of isolation. This difference seems to mirror literary theory, specifically the difference between New Criticism's close reading vs. structuralism feels akin to the difference between analyzing individual words and analyzing collocations. Are there other applications in which computational linguistics reflect common lit crit theories well? Are applications of post-structuralism and reader response theories with computational linguistics more common in sociological contexts?
I think this chapter is very informative on explaining the concept of collocations, but I'm also wondering which statistical inference test is most useful for which cases. How should we evaluate empirically which collocation cases require which hypothesis testing?
Collocation is not compositional because the meaning of the expression cannot be predicted from the meaning of the parts. The detection of collocation is through word counting. If two words occur together a lot, then they may have a special function. I think the game-changer in this research method is the simple heuristic proposed by Justeson and Katz. I would like to know more about this part-of-speech filter, which only lets through those patterns that are likely to be 'phrase'.
This chapter is very illuminating because it provides a lot of useful information and background for me to understand what Gentzkow and Shapiro were doing in their (wonderful) paper (especially for the text processing part).
I was trying to figure out how the authors came up with the top 1,000 phrases before calculating the χ^2. Did the authors use the frequency-based method (p. 153) or hypothesis testing method (χ^2 test p. 169) mentioned in the chapter? I think what they were doing is essentially determining which collocations were used by Republicans/Democrats that are most likely to determine partisanship. Unfortunately, the paper and its appendix did not mention how they came up with the phrases before calculating the χ^2 statistic which was used to select the top 1,000 phrases. Did they just generate bi-grams and tri-grams and count how many times those occur? It seems likely to me. Then, they used these counts for χ^2. So, if this's true, they used the frequency-based method and, on top of that, they used χ^2 to winnow the phrases! One way to find out if this is true is to look at their code, but they are in .do (requiring Stata which I have no license locally) and I'm too lazy to use Stata via UChicago. If other folks try it out, I'd be curious to know.
I don't quite understand when a collocation of a noun phrase (like the "strong tea" example) can actually be considered a collocation instead of simply representing the individual means of its words brought together in a phrase. The t-test example with "new company" is illustrative for this. Why do we consider "strong tea" a collocation, while "new company" is understood as just completely compositional?
I am wondering if it is possible to construct a word network based on this t-statistic collocation relationship and to which extent network analysis techniques make sense on this word network. Further, I am curious about whether one can derive any more comprehensive knowledge about the topic of the target article as a whole based on bigrams or Ngrams.
This chapter introduces several methods of hypothesis testing for collocation analysis. I have the same problem as is mentioned by @sanittawan. As far as I am concerned, what Gentzkow and Shapiro (2010) use is the frequency-based method, and they use $\chi^2$-test for the null hypothesis that "the propensity to use phrase p of length l is equal for Democrats and Republicans". Recalling that the functionality of the $\chi^2$-test is to test the difference between the observed frequency and the expected frequency, the expression of $\chi^2$ given by Gentzkow and Shapiro (2010) is quite like the form of the equation given by (5.6) in this chapter. However, I tried to substitute $O$ and $E$ in (5.6) by the corresponding frequencies of phrases in Gentzkow and Shapiro (2010) but did not get the same equation in Gentzkow and Shapiro (2010). The dominator is slightly different. I look forward to seeing the answer.
This article demonstrates several methods for finding collocations in a text corpus such as frequency-based collocation discovery, variance-based collocation discovery and hypothesis testing using multiple statistics. I have a question about if all of these methods have consistent conclusions. If there is any inconsistency, how should we evaluate the results of them?
This article is very informative about how to collocate words in content analysis. Given that it gives many methods about how to discover collocations, how to estimate the results of these methods and combine what can be obtained by these methods to get the optimal result is the question I am most concerned with.
Chapter 5 gives an overview of a number of ways to look for collocations in texts. I'm wondering how this distinction between collocations and compositional expressions can be applied in N-grams in content analysis. That is, how to use N-grams to distinguish a collocation where the meaning of the expressions is not the sum of the parts?
This reading explains how to account for collocations. I was interested in the fact that collocations are not usually replaced by synonyms. In the reading, it says that people describe drugs with "powerful" but not for tea or coffee. While this makes sense, I was wondering if this clear bond of collocations would change with exposure to ESL's. English is spoken not only be Americans but also by people over the world. As people are more exposed to English spoken by ESL's, the rules for collocations may weaken.
I am generally interested in some actual practice/research which uses collocations to infer useful social information...I generally feel uncertain about applying it to social science research.
So I'm new to collocation, but considering our Digitized Books orienting reading, would this perhaps be an appropriate method for looking at innovation in language? Perhaps it would be appropriate to construct a longitudinal dataset and see what n-grams moved from compositional to collocational?
The several ways of discovering collocation are very interesting. As @rkcatipon noted above, I wonder can we apply these methods to the Digitized Books dataset to find when and how the collocation emerged (whether it emerged from one origin or several origins simultaneously, etc.) ?
I think it is interesting to use methods like mean-variance detection to find collocations in text. I have an idea that if this method could be applied to discovering some new collocations developed in people's conversations. I think this would be really useful for the analysis of network language. Every year, people tend to build to some new collocations to vividly express their meanings (quite common in China). One concern is that users might use very informal sentences or formatting (considering use of abbreviations, emoticons...) in online text, so how to adapt this general method to an online language background should be considered first. But I still think this would be an interesting topic to explore in online communities.
Post questions here for:
Manning and Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press: selections from Chapter 5 (“Collocations”): 151-163, 172-176, 183-186.