Method:
1) Do tf-idf on the boilerplate (or just frequencies):
tfidf.fit(....)
tfidf.transform(....)
2) Do k-means clustering with, for example, k=10, on the feature matrix built by running tfidf.transform(....)
3) For each cluster, get the 10 most common words, and try to label each cluster
*The tfidf and word frequency code can be found in evergreen_without_raw_data.py for the models. It's very easy to use. I don't know how to do k-means.
Try to come up with our own categories.
Method: 1) Do tf-idf on the boilerplate (or just frequencies): tfidf.fit(....) tfidf.transform(....) 2) Do k-means clustering with, for example, k=10, on the feature matrix built by running tfidf.transform(....) 3) For each cluster, get the 10 most common words, and try to label each cluster
*The tfidf and word frequency code can be found in evergreen_without_raw_data.py for the models. It's very easy to use. I don't know how to do k-means.