HDBSCAN* main module.
Compiled using JRE 8.
The source code is provided as an eclipse project.
Implemented by Zachary Jullion (zjullion@ualberta.ca)
Original paper:
CAMPELLO, R. J. G. B.; MOULAVI, D.; SANDER, J.; ZIMEK, A.; Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data.
Included in this distribution is an example data set (example_data_set.csv) which consists of 500 objects, each with 2 attributes.
Also included is an example constraints file (example_constraints.csv) which has 10 constraints for the example data set given above.
This constraints file is an optional input for the algorithm.
DISCLAIMER: For any type of performance evaluation, the user must set the "compact" flag to true.
Program help:
Executes the HDBSCAN* algorithm, which produces a hierarchy, cluster tree, flat partitioning, and outlier scores for an input data set.
Usage: java -jar HDBSCANStar.jar file= minPts= minClSize= [constraints=] [compact={true,false}] [dist_function=]
By default the hierarchy produced is non-compact (full), and euclidean distance is used.
Example usage: "java -jar HDBSCANStar.jar file=input.csv minPts=4 minClSize=4"
Example usage: "java -jar HDBSCANStar.jar file=collection.csv minPts=6 minClSize=1 constraints=collection_constraints.csv dist_function=manhattan"
Example usage: "java -jar HDBSCANStar.jar file=data_set.csv minPts=8 minClSize=8 compact=true"
In cases where the source is compiled, use the following: "java HDBSCANStarRunner file=data_set.csv minPts=8 minClSize=8 compact=true"
The input data set file must be a comma-separated value (CSV) file, where each line represents an object, with attributes separated by commas.
The algorithm will produce five files: the hierarchy, cluster tree, final flat partitioning, outlier scores, and an auxiliary file for visualization.
The hierarchy file will be named _hierarchy.csv for a non-compact (full) hierarchy, and _compact_hierarchy.csv for a compact hierarchy.
The hierarchy file will have the following format on each line:
<hierarchy scale (epsilon radius)>,<label for object 1>,<label for object 2>,...,
The cluster tree file will be named _tree.csv
The cluster tree file will have the following format on each line:
,,,,,,, is the character offset of the line in the hierarchy file at which the cluster first appears.
The final flat partitioning file will be named _partition.csv
The final flat partitioning file will have the following format on a single line: