Open ouverz opened 8 years ago
Thanks @ouverz!
cluster 0: word1, word2, word1, word 3, word4
, the two word1
instances are probably actually different forms of the word but when you look them up in the dataframe it returns the first match on the index. Think of the dataframe as a dictionary akin to:
{ 'run : 'running',
'run': 'runner',
'runs': 'runs' }
If you use this dictionary to look up a stem run
it could have actually been any of the three. To check if this is the case try changing the tokenizer from tokenize_and_stem
to tokenizer_only
. You'll also need to change the print statement so that you don't look up stems in the vocab_frame
(e.g. vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0]
should become something like just terms[ind]
.
For 2. and 3. I'm not 100% sure I follow but I'd be happy to take a look at the code if you can post a gist along with a sample of what the data looks like. Let me know if this helps!
Thanks for the info @brandomr
Here is the implementation of the clustered words in both stem & tokenised as well as only tokenised: Stemmed Cluster 0 words: departure, during, during, renting, machine, Cluster 1 words: apartment, renting, fee, room, flat, Cluster 2 words: room, deposit, security, security, located,
Non-stemmed Cluster 0 words: [u'room'], [u'deposit'], [u'security'], [u'security', u'deposit'], [u'located'], Cluster 1 words: [u'departure'], [u'dryer', u'gas'], [u'machine', u'dryer'], [u'gas', u'cooker'], [u'oven', u'washing'], Cluster 2 words: [u'apartment'], [u'rent'], [u'fee'], [u'flat'], [u'caution'],
Please find the code and sample data below: code gist: https://gist.github.com/ouverz/bdf6b0f537726ae03161
Sample data: Hi The room is 18m2 each, very clean with a fully equipped kitchen,living room,private bathroom and toilet for each. There is washing machine, dish washer, including internet. The room is furnished with a double size mattress, reading table and closet . It is located in central neighborhood. The room is 425 Euros plus a security deposit of 225 Euros While the entire flat cost 910 Euros plus a security deposit of 450 Euros The security deposit is refundable at the termination of the lease. Hope to hear from you Etan
@ouverz sorry for the delay. In looking at your sample and the resultant clusters it looks like you have pretty homogenous documents which will have significant overlap. Your clusters are going to be impacted by the parameters you provide for the TfidfVectorizer
here.
You might want to tune the min_df
and max_df
parameters. For example, I might try decreasing both parameters so that the words you use as features are more "unique" to the clusters. If you set min_df
as a whole number (e.g. 2) it means that the word (feature) must exist in at least 2 documents in the corpus. If set to 0.1 it means the feature must occur within 10% of the documents. With the parameters as you've set them, you require the features to exist in no less than 20% of documents but no more than 80%. Given your corpus is pretty homogenous you're going to see pretty similar clusters.
Thanks for your response. I apologise for getting back to this only now. What is happening with my model and data is quite odd. While it seems that the data is homogenous somehow unsupervised methods produce a near perfect separation between two classes which is mind boggling at this point. One point that stands out is the sheer drop in features from over 1000 to under 50 when I optimise between min_df 0.1 to 0.2 respectively. Then the accuracy goes from under 70% to over 90%. I had two questions - do you know what the difference is between using the 'tokeniser' parameter in the TfidfVectorizer and using 'analyser = word' there? It seems as if both are creating vectors or perhaps I do not understand what the analyser does. Additionally, do you know of any ability to extract variable/feature importance from these classifiers? So if i am running an SVM or NB? Is there some feature extraction module to understand the important features from these models?
Thanks a lot!
That does sound pretty intriguing. As for the number of features dropping when you increase min_df
--that suggests to me that you have a significant number of features that occur in between 10 to 20% of the documents. You might look at word frequencies to verify this.
As far as the analyzer
parameter--I actually haven't touched that. Looking at the docs it appears that only if analyzer == word
does the tokenizer
param come into play. Since you can see in the docs that analyzer
has a default value of word
it will end up using whatever tokenizer you pass it. If the analyzer is set to char
you can see in the sklearn source that it generates n-grams using characters not words. I've never taken this approach and am not sure when it would be useful. Maybe if your textlookedsomethinglikethis?
As far as feature importance--with kmeans the top terms for the cluster are actually the terms nearest the centroid so they are the "most important" or at least most associated with the cluster. Outside this, I'll have to think about how you might get feature importance from unsupervised methods. If you're using a supervised method like SVM of NB you can get feature importance from the classifier coefficients or weights.
For an SVM with sklearn it's something like:
from sklearn import svm
svm = svm.SVC(kernel='linear')
svm.fit(features, labels)
print svm.coef_
If you look around you can find explanations of how to actually interpret the output elsewhere or you could check the math behind sklearn (if you're feeling bold).
Hi I just went over your document clustering tutorial and it is really amazing ! great work! I am trying to conduct a clustering of e-mails, so I have been altering the code a bit to fit my purpose. When I print the words in each cluster I get the same word reiterated in the same cluster (cluster 0: word1, word2, word1, word 3, word4 etc) or the same word appears in two or more clusters.