DrCoffey / DeepSqueak

DeepSqueak v3: Using Machine Vision to Accelerate Bioacoustics Research
BSD 3-Clause "New" or "Revised" License
373 stars 89 forks source link

FixClustAssignBug #139

Closed gcalongi closed 3 years ago

gcalongi commented 3 years ago

Fixed bug that was assigning to calls to the wrong cluster in clustAssign. The centroid that is closest using squared Euclidean distance is not necessarily (and sometimes is frequently not) the cluster to which a call was originally assigned. This was also occasionally causing a crash in Update Clusters (when the number of unique clusters =/= the actual number of clusters).

DrCoffey commented 3 years ago

Hey @gcalongi. Great catch, there was something very silly going on here, but I think the solution was much simpler than you thought.

We are using "knnsearch" to produce cluster assignment instead of the direct output of "kmeans". This is essential because we want to be able to save a k-means model (the centroids and cluster names) and apply them to new data, post-hoc and eventually in real-time.

Unfortunately we were inadvertently using different distance metrics for each method (that is why you were getting different assignments)

[clustAssign,D] = knnsearch(C,data,'Distance','seuclidean'); is not actually using Squared Euclidean Distance, but Standardized Euclidean Distance which is calculated slightly differently. I believe all you had to do change that one line to: [clustAssign,D] = knnsearch(C,data,'Distance','euclidean'); and you would have gotten the same results, since Euclidean and Squared Euclidean produce the same output, but squared euclidean in computationally faster.

I did some testing with your version and my version (with the two different calculations) the pics are attached. File names describe the method, blue lines are nearest centroids, red lines are the calls (first 36 in the example mouse file; third centroid is different with the current standardized euclidean method).

Let me know if this makes sense, or if you think something else is going on here, because otherwise I am going to fix the 1 line and move on to the session save issue.

-Kevin Kmeans-SquaredEclideanDistance-GABI Knnsearch-EclideanDistance-master Knnsearch-StandardizedEclideanDistance-master

gcalongi commented 3 years ago

Yes that seems to fix the issue on my end as well and yes of course I see why you want to keep knnsearch rather than depending on the kmeans output. So sounds good to me!

DrCoffey commented 3 years ago

Great, and thanks again for finding this! I would have never noticed that "seuclidean" and "sqeuclidean" were different :)

gcalongi commented 3 years ago

Yeah I truly feel like someone at Matlab is messing with us lol