antoniocavalcante/hdbscan-java

HDBSCAN* main module. Compiled using JRE 8. The source code is provided as an eclipse project.

Implemented by Zachary Jullion (zjullion@ualberta.ca)

Original paper: CAMPELLO, R. J. G. B.; MOULAVI, D.; SANDER, J.; ZIMEK, A.; Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data.

Included in this distribution is an example data set (example_data_set.csv) which consists of 500 objects, each with 2 attributes. Also included is an example constraints file (example_constraints.csv) which has 10 constraints for the example data set given above. This constraints file is an optional input for the algorithm.

DISCLAIMER: For any type of performance evaluation, the user must set the "compact" flag to true.

Program help:

Executes the HDBSCAN* algorithm, which produces a hierarchy, cluster tree, flat partitioning, and outlier scores for an input data set. Usage: java -jar HDBSCANStar.jar file= minPts= minClSize= [constraints=] [compact={true,false}] [dist_function=] By default the hierarchy produced is non-compact (full), and euclidean distance is used. Example usage: "java -jar HDBSCANStar.jar file=input.csv minPts=4 minClSize=4" Example usage: "java -jar HDBSCANStar.jar file=collection.csv minPts=6 minClSize=1 constraints=collection_constraints.csv dist_function=manhattan" Example usage: "java -jar HDBSCANStar.jar file=data_set.csv minPts=8 minClSize=8 compact=true" In cases where the source is compiled, use the following: "java HDBSCANStarRunner file=data_set.csv minPts=8 minClSize=8 compact=true"

The input data set file must be a comma-separated value (CSV) file, where each line represents an object, with attributes separated by commas. The algorithm will produce five files: the hierarchy, cluster tree, final flat partitioning, outlier scores, and an auxiliary file for visualization.

The hierarchy file will be named _hierarchy.csv for a non-compact (full) hierarchy, and _compact_hierarchy.csv for a compact hierarchy. The hierarchy file will have the following format on each line: <hierarchy scale (epsilon radius)>,<label for object 1>,<label for object 2>,..., Noise objects are labelled zero.

The cluster tree file will be named _tree.csv The cluster tree file will have the following format on each line:

,,,,,,, is the character offset of the line in the hierarchy file at which the cluster first appears. The final flat partitioning file will be named _partition.csv The final flat partitioning file will have the following format on a single line: ,,..., The outlier scores file will be named _outlier_scores.csv The outlier scores file will be sorted from 'most inlier' to 'most outlier', and will have the following format on each line: ,

antoniocavalcante / hdbscan-java

readme

Implemented by Zachary Jullion (zjullion@ualberta.ca)