The limits of meaningful similarity and dissimilarity

In one of the readings for this week, the authors mention that the typical number of unique words/tokens used in a text (after stemming and dropping punctuation and extremely frequent or infrequent words) is about 3500. Considering this and the conventions of writing that one is taught in a general education and through cultural influence, I am curious to know how meaningful similarity and dissimilarity in texts can be. Certainly we can check word distributions and use t-tests or chi-squared to see if they are meaningfully different but what do unsupervised clustering methods consider to be the relevant characteristics of text that would differentiate or unify them? There doesn't seem to be much discussion of it in the Flat Clustering and Hierarchical clustering readings.

Computational-Content-Analysis-2018 / 19-Jan-Flat-Clustering

The limits of meaningful similarity and dissimilarity #1