Am I doing something wrong here

ouverz commented 8 years ago

Hi I just went over your document clustering tutorial and it is really amazing ! great work! I am trying to conduct a clustering of e-mails, so I have been altering the code a bit to fit my purpose. When I print the words in each cluster I get the same word reiterated in the same cluster (cluster 0: word1, word2, word1, word 3, word4 etc) or the same word appears in two or more clusters.

Would you assume there is a problem with the code or is this theoretically possible? scratching my head at the moment.
The dataset you worked with had titles and synopsis, for me the synopsis are the contents of the emails and I was thinking to add a "category" called spam/ham instead of the titles to have more informative data points.
For the hierarchy document clustering I changed the labels to 'terms' instead of 'titles'. Does that make sense?

brandomr commented 8 years ago

Thanks @ouverz!

It sounds like you might have pretty tight clusters with a lot of similar words. When you conduct the reverse lookup using the pandas dataframe what is likely happening is that in cluster 0: word1, word2, word1, word 3, word4, the two word1 instances are probably actually different forms of the word but when you look them up in the dataframe it returns the first match on the index.

Think of the dataframe as a dictionary akin to:

{ 'run : 'running',
  'run': 'runner',
  'runs': 'runs' }

If you use this dictionary to look up a stem run it could have actually been any of the three. To check if this is the case try changing the tokenizer from tokenize_and_stem to tokenizer_only. You'll also need to change the print statement so that you don't look up stems in the vocab_frame (e.g. vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0] should become something like just terms[ind].

For 2. and 3. I'm not 100% sure I follow but I'd be happy to take a look at the code if you can post a gist along with a sample of what the data looks like. Let me know if this helps!

ouverz commented 8 years ago

Thanks for the info @brandomr

Here is the implementation of the clustered words in both stem & tokenised as well as only tokenised: Stemmed Cluster 0 words: departure, during, during, renting, machine, Cluster 1 words: apartment, renting, fee, room, flat, Cluster 2 words: room, deposit, security, security, located,

Non-stemmed Cluster 0 words: [u'room'], [u'deposit'], [u'security'], [u'security', u'deposit'], [u'located'], Cluster 1 words: [u'departure'], [u'dryer', u'gas'], [u'machine', u'dryer'], [u'gas', u'cooker'], [u'oven', u'washing'], Cluster 2 words: [u'apartment'], [u'rent'], [u'fee'], [u'flat'], [u'caution'],

Please find the code and sample data below: code gist: https://gist.github.com/ouverz/bdf6b0f537726ae03161

Sample data: Hi The room is 18m2 each, very clean with a fully equipped kitchen,living room,private bathroom and toilet for each. There is washing machine, dish washer, including internet. The room is furnished with a double size mattress, reading table and closet . It is located in central neighborhood. The room is 425 Euros plus a security deposit of 225 Euros While the entire flat cost 910 Euros plus a security deposit of 450 Euros The security deposit is refundable at the termination of the lease. Hope to hear from you Etan

brandomr commented 8 years ago

@ouverz sorry for the delay. In looking at your sample and the resultant clusters it looks like you have pretty homogenous documents which will have significant overlap. Your clusters are going to be impacted by the parameters you provide for the TfidfVectorizer here.

You might want to tune the min_df and max_df parameters. For example, I might try decreasing both parameters so that the words you use as features are more "unique" to the clusters. If you set min_df as a whole number (e.g. 2) it means that the word (feature) must exist in at least 2 documents in the corpus. If set to 0.1 it means the feature must occur within 10% of the documents. With the parameters as you've set them, you require the features to exist in no less than 20% of documents but no more than 80%. Given your corpus is pretty homogenous you're going to see pretty similar clusters.

ouverz commented 8 years ago

Thanks for your response. I apologise for getting back to this only now. What is happening with my model and data is quite odd. While it seems that the data is homogenous somehow unsupervised methods produce a near perfect separation between two classes which is mind boggling at this point. One point that stands out is the sheer drop in features from over 1000 to under 50 when I optimise between min_df 0.1 to 0.2 respectively. Then the accuracy goes from under 70% to over 90%. I had two questions - do you know what the difference is between using the 'tokeniser' parameter in the TfidfVectorizer and using 'analyser = word' there? It seems as if both are creating vectors or perhaps I do not understand what the analyser does. Additionally, do you know of any ability to extract variable/feature importance from these classifiers? So if i am running an SVM or NB? Is there some feature extraction module to understand the important features from these models?

Thanks a lot!

brandomr commented 8 years ago

That does sound pretty intriguing. As for the number of features dropping when you increase min_df--that suggests to me that you have a significant number of features that occur in between 10 to 20% of the documents. You might look at word frequencies to verify this.

As far as the analyzer parameter--I actually haven't touched that. Looking at the docs it appears that only if analyzer == word does the tokenizer param come into play. Since you can see in the docs that analyzer has a default value of word it will end up using whatever tokenizer you pass it. If the analyzer is set to char you can see in the sklearn source that it generates n-grams using characters not words. I've never taken this approach and am not sure when it would be useful. Maybe if your textlookedsomethinglikethis?

As far as feature importance--with kmeans the top terms for the cluster are actually the terms nearest the centroid so they are the "most important" or at least most associated with the cluster. Outside this, I'll have to think about how you might get feature importance from unsupervised methods. If you're using a supervised method like SVM of NB you can get feature importance from the classifier coefficients or weights.

For an SVM with sklearn it's something like:

from sklearn import svm
svm = svm.SVC(kernel='linear')
svm.fit(features, labels)
print svm.coef_

If you look around you can find explanations of how to actually interpret the output elsewhere or you could check the math behind sklearn (if you're feeling bold).

brandomr / document_cluster

Am I doing something wrong here #5