chrismattmann / tika-similarity

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
Apache License 2.0
107 stars 59 forks source link

Created affinity_propogation.py for clustering #91

Closed kavyabvishwanath closed 6 years ago

kavyabvishwanath commented 6 years ago

The script takes different distance metric as input[editdistance,jaccards,cosine] and computes distance matrix which is then passed to Affinity Propogation for clustering. The script is also generic enough to add new distance metric to pass to affinity propogation. If a directory is provided as input it uses metadata from tika-parser to cluster the files within directory.

Example usage :

  1. To cluster files in the input directory python affinity_propagation.py --inputDir test --distance editdistance --uniqueId resourceName

Example D3 visualisation using edit-distance :

affinity_propogation_clustering_files
  1. To cluster json objs in the JSON input file. python affinity_propagation.py --inputFile [input json file] --distance [editdistance,jaccards,cosine] --config [config file with attribute:datatype] --jsonKey [key of json to read data from] --uniqueId [unique id in the dataset]

Example D3 visualisation using edit-distance :

affinity_propogation_clustering_json_objects
chrismattmann commented 6 years ago

thank you - awesome!!