need a way to measure distance

rupamroy commented 4 years ago

Issue description

Let say i have added a few sentences and corresponding encodings to the simpleneighbors index. Lets say the sentences are ['the cat is a nice tamable animal', 'cats are domestic but often can get wild', 'dogs are very friendly ', dogs are considered man best friend']

Now lets say we encode a question describe cats and use the index.nearest function to find the nearest sentences, it returns [['the cat is a nice tamable animal', 'cats are domestic but often can get wild'] which is great and expected.

But if i encode a question what is an aircraft , still the index will return some of the sentences form the corpus.

Expected

If the distance is really large compared to documents in the corpus it should not return nearest neighbors, this behavior could be flagged or a threshold could be take as input as the max distance allowed.

rupamroy commented 4 years ago

@aparrish any thoughts on this one

aparrish commented 4 years ago

The API has an existing method for measuring distance between two items in the corpus. You could use this method as a separate second filter on the nearest neighbor results, or use it in the .nearest_matching() or .neighbors_matching() functions, passing in a filtering function that checks the distance between each match and a target item in the corpus, something like:

nn = SimpleNeighbors()
nn.feed(your_data)
nn.build()
print(list(nn.neighbors_matching('item1', n=10, check=lambda x: nn.dist('item1', x) < 0.5)))

(Haven't tested this code, but hopefully you get the idea. The benefit of the *_matching() methods over simply checking the results in a for loop is that they'll keep on looking through the nearest neighbors until at least n matching items have been found.)

In general, I'm hesitant to add use-case specific checks and parameters to the API—I made the *_matching() methods in order to obviate such changes!

rupamroy commented 4 years ago

@aparrish Thanks i appreciate the help here , will try that lambda.

aparrish / simpleneighbors

need a way to measure distance #3

Issue description

Expected