chrismattmann / tika-similarity

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Apache License 2.0
107 stars 59 forks source link

Clustering of files in Tika-Similarity not happening in an expected way. #66

Closed RashmiNalwad closed 8 years ago

RashmiNalwad commented 8 years ago

Edit-value-similarity.py is working perfectly fine and generates output.csv with correct similarity scores. But edit-cosine-circle-packing.py is clustering the scores in an unexpected way.

PFA generated output.csv file and the circlepacking.html

Kindly look in to this.

capture

output.txt

chrismattmann commented 8 years ago

please define "unexpected way"

RashmiNalwad commented 8 years ago

Edit-value-similarity.py produces scores [0.75,1.0,1.0,0.75,1.0,0.875]. My understanding of how edit-cosine-circle-packing.py should produce clusters is : 1) It has to create 3 clusters cluster 0 having values [1.0,1.0,1.0] cluster 1 having values [0.75,0.75] cluster 2 having value [0.875]

But its producing 3 clusters with values: cluster 0 = [0.75,1.0] cluster 1 = [0.75,1.0,1.0] cluster 2 = [0.875]

When checked in source code there is no actual clustering happening based on score values.

chrismattmann commented 8 years ago

Got it. @RashmiNalwad can you suggest a PR? cc @harsham05

RashmiNalwad commented 8 years ago

Thanks @chrismattmann working on this issue at https://github.com/RashmiNalwad/tika-similarity/pull/1. Request @harsham05 for his suggestions.

chrismattmann commented 8 years ago

great thanks @RashmiNalwad @harsham05 please review

RashmiNalwad commented 8 years ago

Fixed clustering issue for edit_cosine_circle_packing.py. Data will be clustered based on the similarity scores.Same Issue is fixed for edit_cosine_cluster.py https://github.com/RashmiNalwad/tika-similarity

@harsham05 please review same.

chrismattmann commented 8 years ago

please submit a pull request to this repo @RashmiNalwad