Closed rupamroy closed 4 years ago
@aparrish any thoughts on this one
The API has an existing method for measuring distance between two items in the corpus. You could use this method as a separate second filter on the nearest neighbor results, or use it in the .nearest_matching() or .neighbors_matching() functions, passing in a filtering function that checks the distance between each match and a target item in the corpus, something like:
nn = SimpleNeighbors()
nn.feed(your_data)
nn.build()
print(list(nn.neighbors_matching('item1', n=10, check=lambda x: nn.dist('item1', x) < 0.5)))
(Haven't tested this code, but hopefully you get the idea. The benefit of the *_matching()
methods over simply checking the results in a for
loop is that they'll keep on looking through the nearest neighbors until at least n
matching items have been found.)
In general, I'm hesitant to add use-case specific checks and parameters to the API—I made the *_matching()
methods in order to obviate such changes!
@aparrish Thanks i appreciate the help here , will try that lambda.
Issue description
Let say i have added a few sentences and corresponding encodings to the simpleneighbors index. Lets say the sentences are ['the cat is a nice tamable animal', 'cats are domestic but often can get wild', 'dogs are very friendly ', dogs are considered man best friend']
Now lets say we encode a question
describe cats
and use the index.nearest function to find the nearest sentences, it returns [['the cat is a nice tamable animal', 'cats are domestic but often can get wild'] which is great and expected.But if i encode a question
what is an aircraft
, still the index will return some of the sentences form the corpus.Expected
If the distance is really large compared to documents in the corpus it should not return nearest neighbors, this behavior could be flagged or a threshold could be take as input as the max distance allowed.