Failing KNN test - Githubissues

JulienPeloton commented 6 years ago

Looking at Travis, there is something weird. Starting from commit cfc7a5f, sometimes the build fails, sometimes it succeeds. From cfc7a5f, I have done only commits related to documentation (no code change) — so I’m wondering why this behaviour. It is the same in my laptop, sometimes it fails, sometimes it succeeds.

Looking at the failing test (SpatialQueryTest.scala:Can you find the K nearest neighbours correctly?), it seems that there is a little problem when looking at unique elements...

Return unique elements

scala> // Run many times the same query
scala> for (i <- 0 to 10) {
     |   val knn = SpatialQuery.KNN(queryObject, sphereRDD_part, 3, true)
     |   println(knn.map(x => x.center.getCoordinate))
     | }
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 1.0, 1.0))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(3.0, 2.0, 1.0))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 1.0, 1.0))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 3.0, 0.7))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 3.0, 0.7))
List(List(2.0, 2.0, 2.0), List(1.0, 3.0, 0.7), List(3.0, 2.0, 1.0))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 3.0, 0.7))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 3.0, 0.7))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 3.0, 0.7))
List(List(2.0, 2.0, 2.0), List(1.0, 3.0, 0.7), List(3.0, 2.0, 1.0))
List(List(2.0, 2.0, 2.0), List(1.0, 1.0, 3.0), List(1.0, 3.0, 0.7))

The 2nd and 3rd elements are not always the same (and it is not just a matter of ordering)! Hence why the test is sometimes failing, sometimes passing. This looks like a bug to correct... @mayurdb any ideas?

mayurdb commented 6 years ago

Yeah, the test case does looks flaky. I'll take a look!

mayurdb commented 6 years ago

While debugging this, we found another issue because of the duplicates, which might cause less than k elements being returned even though RDD has more than k elements,

In the implementation first, we take k nearest elements from each partition based on the ordering relative to the queryObject (assumption here is that there are no duplicates within the partitions. Duplicates within a partition can only come from.coalesce so, users should be careful to use .coalesce.distinct)
then we combine elements from each partition into a list of k elements
consider a case when k = 1 and numPartitions = 3
from each partition, we would take one element which will be at least distance from the query object center
but we can get the same element from all three partitions (a case when an element belongs to more than 1 partition)
in this case, we would only get 1 element as an output to KNN call (for combining individual lists from each partition, we again use a priority queue which holds only unique elements)

Resolving this would require us to maintain a single priority queue for all partitions, this will destroy the parallelism and at the same time, will result in a big list being shuffled across the network.

mayurdb commented 6 years ago

Have created #80 with the fix for inconsistency in the results.

astrolabsoftware / spark3D

Failing KNN test #79

Return unique elements