brandomr / document_cluster

A guide to document clustering in Python
http://brandonrose.org/clustering
510 stars 339 forks source link

cosine_similarity(x,y) #11

Open bkieler opened 7 years ago

bkieler commented 7 years ago

I attempted to apply the method to clustering tweets. I may be misunderstanding how this works, but running it with cosine_similarity(matrix name) only worked when my data was very small (500 tweets). Once I went to 150,000 tweets, I received memory errors. I used what the documentation said here, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html, by adding cosine_similarity(matrix[len - 1], matrix) which I found in another example elsewhere since lost.

Is there a reason your code runs it without passing the x and y separately?

brandomr commented 7 years ago

@bkieler as far as the specific question you can test this with a basic example:

from sklearn.metrics.pairwise import cosine_similarity

ar1 = [0,3,4,1,3,5]
ar2 = [1,2,4,3,1,3]

print cosine_similarity(ar1,ar2)

returns [[ 0.87773382]] and if you add

ar3 = [[1,2,4,3,1,3],[1,2,5,2,1,1],[1,2,2,3,1,7],[1,0,1,3,1,2]]
print cosine_similarity(ar3)

you are returned a new array:

array([[ 1.        ,  0.87773382],
       [ 0.87773382,  1.        ]])

The difference is that you are returned a new numpy array in the latter case, but functionally the calculation is the same.

The OOM errors you are seeing are likely because the tfidf matrix is massive. You might need to reduce the max_features allowed in the TfidfVectorizer parameters. Scikit learn is trying to run this calculation in memory and you're just running out of it. If you want to operate on a large dataset you might need to use a computing cluster and something like Spark MLlib

GlorianY commented 7 years ago

Hi!

I also have a similar problem. My TF-IDF matrix is huge. So, I tried to use the workaround that suggested by @bkieler, that is, adding cosine_similarity(matrix[len - 1], matrix).

However, this yields to a problem in visualizing the data. Specifically, in this line "pos = mds.fit_transform(dist)". Here, the problem is "dist" has to be an array. Because of the workaround that I mentioned above, "dist" returns a value instead of an array.

The question is, how should I modify the code (i.e. dist) to adjust with the workaround?