Webpage Categorization (Clustering)

Try to come up with our own categories.

Method: 1) Do tf-idf on the boilerplate (or just frequencies): tfidf.fit(....) tfidf.transform(....) 2) Do k-means clustering with, for example, k=10, on the feature matrix built by running tfidf.transform(....) 3) For each cluster, get the 10 most common words, and try to label each cluster

*The tfidf and word frequency code can be found in evergreen_without_raw_data.py for the models. It's very easy to use. I don't know how to do k-means.

SvenAG / SNLP-Final-Project

Webpage Categorization (Clustering) #4