Return error from clustering

wraseman commented 5 years ago

@joshhjacobson, as you known, the k-means clustering algorithm searches for a local minimum based on randomly generated initial starting points for the clustering search. Since it does not guarantee a global minimum, comparing multiple clustering runs is necessary if you want to select the best clustering configuration. Because of this, we need to have access to the error related to each centroid from kmeans() to inform the user of the error associated with each clustering run. These errors would have to be aggregated across centroids to provide a single metric for the user.

This video does a nice job expressing what we need to do. Instead of having multiple panels, however, the user would just have to remember what the lowest value was.

I'm not sure what the best way to return this information would be. Do you have any ideas on that?

Specifically, I would like to display the aggregate error across clusters as text on the web application like this (the XXX represents where the error would be displayed):

wraseman commented 5 years ago

Correction: can we return the aggregate within sum of squares deviation across the clusters instead of the error?

joshhjacobson commented 5 years ago

Here's what a basic call to the clustering library looks like (note that we do not specify an initialization, but we do provide the user the ability to do so):

let result = kmeans(data, 2, { initialization: centers });
console.log(result);
/*
KMeansResult {
  clusters: [ 0, 0, 1, 1 ],
  centroids: 
   [ { centroid: [ 1, 1.5, 1 ], error: 0.25, size: 2 },
     { centroid: [ -1, -1, -1.25 ], error: 0.0625, size: 2 } ],
  converged: true,
  iterations: 1
}
*/

As you can see, we have the error term for each centroid. I'm not sure I understand exactly what you want, but can we use this information to compute it?

wraseman commented 5 years ago

Yes, I think we can. To calculate the within cluster sum of squares we need to do the following: Source: stats.stackexchange.com

We also need to have the observations and to which cluster they belong so we can do the calculation. Does that seem feasible?

joshhjacobson commented 5 years ago

@wraseman Based on our last meeting, this is no longer something we are planning to implement. Is that right?

wraseman commented 5 years ago

True, I'll close the issue.

ParasolJS / parasol-es

Return error from clustering #18