Closed gcalongi closed 3 years ago
Hey @gcalongi. Great catch, there was something very silly going on here, but I think the solution was much simpler than you thought.
We are using "knnsearch" to produce cluster assignment instead of the direct output of "kmeans". This is essential because we want to be able to save a k-means model (the centroids and cluster names) and apply them to new data, post-hoc and eventually in real-time.
Unfortunately we were inadvertently using different distance metrics for each method (that is why you were getting different assignments)
[clustAssign,D] = knnsearch(C,data,'Distance','seuclidean'); is not actually using Squared Euclidean Distance, but Standardized Euclidean Distance which is calculated slightly differently. I believe all you had to do change that one line to: [clustAssign,D] = knnsearch(C,data,'Distance','euclidean'); and you would have gotten the same results, since Euclidean and Squared Euclidean produce the same output, but squared euclidean in computationally faster.
I did some testing with your version and my version (with the two different calculations) the pics are attached. File names describe the method, blue lines are nearest centroids, red lines are the calls (first 36 in the example mouse file; third centroid is different with the current standardized euclidean method).
Let me know if this makes sense, or if you think something else is going on here, because otherwise I am going to fix the 1 line and move on to the session save issue.
-Kevin
Yes that seems to fix the issue on my end as well and yes of course I see why you want to keep knnsearch rather than depending on the kmeans output. So sounds good to me!
Great, and thanks again for finding this! I would have never noticed that "seuclidean" and "sqeuclidean" were different :)
Yeah I truly feel like someone at Matlab is messing with us lol
Fixed bug that was assigning to calls to the wrong cluster in clustAssign. The centroid that is closest using squared Euclidean distance is not necessarily (and sometimes is frequently not) the cluster to which a call was originally assigned. This was also occasionally causing a crash in Update Clusters (when the number of unique clusters =/= the actual number of clusters).