genbattle / dkm

A generic C++11 k-means clustering implementation
MIT License
209 stars 47 forks source link

Can you expand in the readme to give more instructions? #3

Closed kevinGodell closed 6 years ago

kevinGodell commented 6 years ago

I just found your repo because I am trying to make clusters from x, y coordinates that I extract from frames from ffmpeg and a remote camera. I ran some live data through the example given in the readme and I am not sure what the returned data is exactly. I have virtually no C++ experience, so please excuse me. When accessing std::cout<<std::get<0>(means)<<std::endl; I get an array of 2 arrays that are structured like [[482, 212], [191, 393]]. Are these the centroids of a cluster? Can your code be used to create a bounding box around clusters of coordinates and if yes, can you please show me? Thanks for the code!

genbattle commented 6 years ago

Hi Kevin,

The comment in the header file for kmeans_lloyd gives some information about what is returned by that method:

Returns a std::tuple containing:
  0: A vector holding the means for each cluster from 0 to k-1.
  1: A vector containing the cluster number (0 to k-1) for each corresponding element of the input
     data vector.

Additionally, the unit tests and benchmarks are intended to also partially serve the role of examples. Adding more documentation about use to the README seems like a pretty easy way of increasing the usability of this library. In the meantime I'll try and answer your questions directly as well as I can.

The kmeans_lloyd method returns a tuple containing two vectors. The first vector is, as you guessed, a list of coordinates for the centre of each cluster (the centroids). The second vector is a list indicating which of the clusters in the first list the points passed into the function belong to. This information was intended to be used for visualising cluster membership, or for further filtering/processing of points based on cluster membership.

The clusters don't have bounding boxes per se. Data points are assigned to clusters based on the closest centroid, so the points in your 2D space which are equidistant between two or more centroids are "boundaries". These boundaries can be plotted by generating a Voroni Diagram using the centroids. For C++ I recommend using something like the boost polygon Voroni extensions to generate the Voroni map. It will give you a collection of segments that you can render to your output to show the boundaries between the clusters. You will need to convert the points that are returned by std::get<0>(means) from type std::array<int, 2> to a boost::polygon::point_data<int>. Read up on the boost::polygon documentation for more info.

Does that cover everything you wanted to know?

kevinGodell commented 6 years ago

Thanks for the additional info. You have cleared up many things for me, but of course that leads me to more questions, if you don't mind.

I have been counting the points in and the points out, and it seems they they are all included with a number referencing which center they are grouped with. Is there any way to filter out any points from being included if they are too far from any calculated center by setting some type of number representing max range from center?

As for the visualization of the data, I am not doing anything too advanced. I just simply have to grab the top left and bottom right point which will represent some rectangle. Those points will be sent to the user's browser and used in some script to plot a rectangle overlaying the video. Luckily, I don't have to make any polygons or any other complicated shapes.

genbattle commented 6 years ago

What you want to do isn't k-means. If you take points out of the centroid calculation, then you're changing the centroid, which may cause other points to then fall outside your max range (or back into it). I think it would be better to leave the calculation of centroids as it is, but modify the code to calculate a variance metric in both dimensions. E.g. you might like to calculate the standard deviation in each dimension for each cluster. This would give you a width and height for your bounding box, which you can draw centred on the corresponding centroid. You could also just take the min and max x and y values in each cluster to generate bounding boxes that encompass all of the values in each cluster. You could calculate either of these using the means and cluster labels generated by kmeans_lloyd, or you could modify the code in kmeans_lloyd to also calculate and return these values as part of its tuple return value.

kevinGodell commented 6 years ago

Thanks for the tips.