dhoule / CS533-Big-Data-Data-Mining

Parallel DBScan, loosely coupled, algorithm using the disjoint-set data structure. From Patwary, M. M. A., et al. The MAKEFILE has been fixed. The compile time errors have been fixed. The run time error; incorrectly numbering clusters; has been fixed.
2 stars 1 forks source link

Link to paper: http://users.ece.northwestern.edu/~choudhar/Publications/ANewScalableParallelDBSCANAlgorithmUsingDisjointSetDataStructure.pdf To download the original, buggy code: http://cucis.ece.northwestern.edu/projects/Clustering/download_code_dbscan.html

/ / / Files: mpi_main.cpp clusters.cpp clusters.h utils.h utils.cpp / / dbscan.cpp dbscan.h kdtree2.cpp kdtree2.hpp / / geometric_partitioning.h geometric_partitioning.cpp / / / / Description: an mpi implementation of dbscan clustering algorithm / / using the disjoint set data structure / / / / Author: Md. Mostofa Ali Patwary / / EECS Department, Northwestern University / / email: mpatwary@eecs.northwestern.edu / / / / Copyright, 2012, Northwestern University / / See COPYRIGHT notice in top-level directory. / / / / Please cite the following publication if you use this package / / / / Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, / / Fredrik Manne, and Alok Choudhary, "A New Scalable Parallel DBSCAN / / Algorithm Using the Disjoint Set Data Structure", Proceedings of the / / International Conference on High Performance Computing, Networking, / / Storage and Analysis (Supercomputing, SC'12), pp.62:1-62:11, 2012. / / /

A Disjoint-Set Data Structure based Parallel DBSCAN clustering implementation (MPI version)

How to run the tool:

  1. Compile the source files using the following command

    make

  2. Run using following command

    mpiexec -n number_of_process ./mpi_dbscan -i filename -b -m minpts -e epsilon -o output[optional]

    Example:

    mpiexec -n 8 ./mpi_dbscan -i clus50k.bin -b -m 5 -e 25 -o clus50k_clusters.nc

    run the following to get detail description on the program arguments

    ./mpi_dbscan ?

  3. Input file format:

    binary file: number of points, N and number of dimensions, D (each 4 bytes) followed by the points coordinates (N x D floating point numbers).

  4. Output file format (Optional, one can get the statistics about the clustering solution without writing the clusters to file):

    netCDF file: The coodinates are named as columns (position_col_X1, position_col_X2, ...) and then one additional column named cluster_id for the corresponding cluster id the point belong to.