accord-net / framework

Machine learning, computer vision, statistics and general scientific computing for .NET
http://accord-framework.net
GNU Lesser General Public License v2.1
4.5k stars 1.99k forks source link

BalancedKMeans.cs: It does not find a solution for this case #451

Open jagbarcelo opened 7 years ago

jagbarcelo commented 7 years ago

I have prepared this little method with all the data you need to try/debug/fix the issue:

public static int[] Test()
{
    int numClusters = 6;
    double[][] observations = {
        new double[] { 10.8, 18.706148721743876 },
        new double[] { -10.8, 18.706148721743876 },
        new double[] { -21.6, 0.0 },
        new double[] { -10.8, -18.706148721743876 },
        new double[] { 10.8, -18.706148721743876 },
        new double[] { 21.6, 0.0 },
        new double[] { 32.400000000000006, 18.706148721743876 },
        new double[] { 21.600000000000005, 37.412297443487752 },
        new double[] { 3.5527136788005009E-15, 37.412297443487752 },
        new double[] { -21.599999999999998, 37.412297443487752 },
        new double[] { -32.4, 18.706148721743876 },
        new double[] { -43.2, 0.0 },
        new double[] { -32.400000000000006, -18.706148721743876 },
        new double[] { -21.600000000000005, -37.412297443487752 },
        new double[] { -3.5527136788005009E-15, -37.412297443487752 },
        new double[] { 21.599999999999998, -37.412297443487752 },
        new double[] { 32.4, -18.706148721743876 },
        new double[] { 43.2, 0.0 } };

    Accord.Math.Random.Generator.Seed = 0;
    Accord.MachineLearning.BalancedKMeans kmeans = new Accord.MachineLearning.BalancedKMeans(numClusters);

    // If a limit is not set, the following Learn call does not return....
    kmeans.MaxIterations = 1000;
    Accord.MachineLearning.KMeansClusterCollection clusters = kmeans.Learn(observations);

    int[] labels = clusters.Decide(observations);
    return labels;
}

The following image shows the distribution of the points (they are the centers of each numbered circle): balancedkmeans-18points

It might seem a rather borderline example (all points are equally distributed using a triangle distribution), but it has only 18 points to be divided in 6 clusters (3 items each one). A valid solution can easily be found visually and even a brute-force backtracking algorithm solves it for such a small amount of points (although it does not scale at all when the numbers grow).

For example, a valid solution would be: (0,6,7), (1,8,9), (2,10,11), (3,13,14), (4,15,16), (5,6,17)

However, the former code returns this solution: int[] labels = { 5, 4, 2, 0, 1, 3, 3, 3, 4, 4, 4, 2, 0, 0, 0, 0, 3, 3 }; As you can see, the clusters are not balanced: Cluster-0 has 5 elements, Cluster-1 has only 1 element and so forth.

If you comment the line setting MaxIterations, the Learn call does not return in quite a long time (more than 15 minutes). If you leave it there, actually shortcutting the calculations, the Proportions are incorrectly calculated. See my previous comment https://github.com/accord-net/framework/issues/450

cesarsouza commented 7 years ago

Hi @jagbarcelo,

Thanks for the very descriptive report! Please take a look at my answer to #450. The .Labels property of the kmeans object will contain the answer you expect. In my machine, I get:

1, 5, 5, 5, 4, 3, 0, 1, 4, 4, 3, 2, 3, 0, 0, 2, 2, 1

Hope it helps and sorry for the lack of documentation!

Cesar

jagbarcelo commented 7 years ago

I cannot manage to reproduce your result. It seems that every execution returns a different set of Labels. I thought that using a fixed Seed the results were reproducible.

Anyway, if I force your solution 1, 5, 5, 5, 4, 3, 0, 1, 4, 4, 3, 2, 3, 0, 0, 2, 2, 1 into my program, the coloured image it produces is this: balancedkmeans-18points-wrong-solution

As you can see, the coloured groups (clusters) are not really grouped. Some of them are not even contiguous. The IDs 6, 13 and 14 belonging to cluster 0 (light green), and IDs 4, 8 and 9 belonging to cluster 4 (light blue), or even IDs 5, 10 and 12 of cluster 3 are examples of this.

Ideally, it should find a solution such as: balancedkmeans-18points-good-solution However, no matter the high value of MaxIterations, or using random Seeds. I cannot manage the BalancedKMeans to find something near that solution.

Maybe (I'm guessing) this algorithm is not the right one for my problem? I thought it was. I need to make groups (clusters) minimising the average distance to the center of each group. They need to be packed groups (contiguous), and I think that is implied into the 'minimise average distance' premise.

What do you think? Do I need to apply other algorithm to my problem or is this algorithm that needs improvements?

cesarsouza commented 7 years ago

Hi @jagbarcelo,

Sorry for the delay in answering, and possibly sorry for the not-very-helpful answer that follows: in a honest answer, I am not sure whether the algorithm is completely suitable for the problem you are facing. It is true, however, that there must be an issue with the implementation as the results were supposed to be reproducible once the random seed has been set. I am a bit short on time right now to give proper attention to this issue (upcoming conference deadline), but I most surely would need/have to revisit it later at some point.

If you want, you can try to setup a test case using another implementation of Balanced K-Means in either R or MATLAB (which you can possibly run from Octave if you do not have a MATLAB license). At least we would be able to know whether the balanced k-means through the Munkres (Hungarian) matching algorithm would work for your case.

Best regards, Cesar

jagbarcelo commented 7 years ago

Hi @cesarsouza,

I finally installed Octave and I've been doing some tests with it.

I tried first with Octave's implementation of kmeans, just to get a working example with graphical output of the results so that it would be easiser to see and understand each solution. I prepared two different set of testing points: one with the former 18 points you already know up there in this thread and another one with 120 points.

Since Octave's implementation does not do any balance of the clusters, I had to do that part myself. I do not understand the basics of the Hungarian/Munkres method so I decided to take a simple kmeans implementation and work with it in order to add some balancing restrictions. It would not be as optimised as the Hungarian method, but would work just for testing the results.

I took Emin Aksehirli's k-means implementation and modified it a bit:

Like many other types of numerical minimizations, the solution that kmeans reaches often depends on the starting points. It is possible for kmeans to reach a local minimum, where reassigning any one point to a new cluster would increase the total sum of point-to-centroid distances, but where a better solution does exist. However, you can use the 'Replicates' name-value pair argument to overcome that problem.

I liked that approach, so I modified Aksehirli's implementation to add this 'replicates' option.

Since in this case we are dealing with geometrical-uniform distributed points, we wanted the results to be as simmetrical/equal as possible. In the best solution, ideally, the variance should be 0, meaning that all clusters have the same sum of average distances from any of its elements to its centroid.

I don't know if you could (or even want) to add such behaviour to your framework: I mean adding an option so that the target function to minimise would change from a default just-distance to a more complex distance-and-variance minimization.

This is the output of a call with 120 points, to be clustered in 6 partitions, with as much as 25 iterations, replicated 15 times:

[idx, centers] = balanced_aksehirli_kmeans (data, 6, 0, 25, 15);

Replicate 1, 6 iterations, total sum of distances = 121.589967, variance = 946.835953
Replicate 2, 14 iterations, total sum of distances = 45.858687, variance = 9.430490
Replicate 3, 12 iterations, total sum of distances = 61.597883, variance = 102.573953
Replicate 4, 9 iterations, total sum of distances = 63.829736, variance = 109.040053
Replicate 5, 13 iterations, total sum of distances = 61.597883, variance = 130.455627
Replicate 6, 6 iterations, total sum of distances = 110.556346, variance = 650.493658
Replicate 7, 6 iterations, total sum of distances = 41.025787, variance = 18.144761
Replicate 8, 21 iterations, total sum of distances = 63.829736, variance = 139.241999
Replicate 9, 8 iterations, total sum of distances = 65.078402, variance = 49.922938
Replicate 10, 25 iterations, total sum of distances = 45.858687, variance = 9.430490
Replicate 11, 24 iterations, total sum of distances = 55.533428, variance = 0.258848
Replicate 12, 7 iterations, total sum of distances = 40.308748, variance = 39.112548
Replicate 13, 17 iterations, total sum of distances = 63.829736, variance = 139.241999
Replicate 14, 14 iterations, total sum of distances = 55.533428, variance = 0.258848
Replicate 15, 9 iterations, total sum of distances = 40.308748, variance = 39.112548
Best total sum of distances = 55.533428, variance = 0.258848

octave 120points 6clusters

The method converges quite quickly so I set the limit to only 25 iterations, but since there might be a lot of local minima, I set a rather high replicate value so that we could have several random initial distrbutions that might eventually find a better overall solution. After those 15 replicates, none of them was the desired optimum with a variance of 0 but we were near.

If we leave the original distance-only minimization (instead of distance-and-variance) we can find solutions as this one:

Replicate 1, 24 iterations, total sum of distances = 55.533428, variance = 0.258848
Replicate 2, 25 iterations, total sum of distances = 34.610588, variance = 108.850479
Replicate 3, 24 iterations, total sum of distances = 63.829736, variance = 139.241999
Replicate 4, 13 iterations, total sum of distances = 40.308748, variance = 39.112548
Replicate 5, 9 iterations, total sum of distances = 108.544030, variance = 569.605472
Replicate 6, 11 iterations, total sum of distances = 37.614401, variance = 30.467535
Replicate 7, 8 iterations, total sum of distances = 55.533428, variance = 0.258848
Replicate 8, 10 iterations, total sum of distances = 41.040000, variance = 35.009908
Replicate 9, 25 iterations, total sum of distances = 34.610588, variance = 423.001851
Replicate 10, 10 iterations, total sum of distances = 61.597883, variance = 102.573953
Replicate 11, 21 iterations, total sum of distances = 37.614401, variance = 30.467535
Replicate 12, 24 iterations, total sum of distances = 45.858687, variance = 9.430490
Replicate 13, 9 iterations, total sum of distances = 61.597883, variance = 70.356508
Replicate 14, 6 iterations, total sum of distances = 41.040000, variance = 35.009908
Replicate 15, 8 iterations, total sum of distances = 61.597883, variance = 130.455627
Best total sum of distances = 34.610588, variance = 108.850479

octave 120points 6clusters bad

Please note that it has a better distance (34.6) as opposed to the former solution (55.5) but a rather high variance (108.8) instead of being near zero (0.259 in the former example). The reason of that high variance are those few elements that are disconnected from their clusters.

I hope you could find time to have a look at this and maybe implement some part of it into the framework.

Here are the sample files: balanced_aksehirli_kmeans.zip

Thanks.

jagbarcelo commented 7 years ago

I find a little flaw in my interpretation of some of the original Aksehirli's code. Therer was a variable called minDist and i used it globally, when in fact it is only updated locally when iterating in the innest 'for' statement.

Thus, it was incorrect to use it at the end of the process to print out the results under the 'total sum of distances'.

Please use this updated balanced_aksehirli_kmeans-2017-03-03.zip. The graphical results are still the same, and even the distant points when optimising only distances (and not distance and variance) also appear. The only thing fixed is the textual output of summaries of each replication.