Computational-Content-Analysis-2018 / 19-Jan-Flat-Clustering

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.
https://github.com/Computational-Content-Analysis-2018
0 stars 1 forks source link

The limits of meaningful similarity and dissimilarity #1

Open TimothyElder opened 6 years ago

TimothyElder commented 6 years ago

In one of the readings for this week, the authors mention that the typical number of unique words/tokens used in a text (after stemming and dropping punctuation and extremely frequent or infrequent words) is about 3500. Considering this and the conventions of writing that one is taught in a general education and through cultural influence, I am curious to know how meaningful similarity and dissimilarity in texts can be. Certainly we can check word distributions and use t-tests or chi-squared to see if they are meaningfully different but what do unsupervised clustering methods consider to be the relevant characteristics of text that would differentiate or unify them? There doesn't seem to be much discussion of it in the Flat Clustering and Hierarchical clustering readings.