JuliaNeighbors / VPTrees.jl

Vantage Point Tree in Julia
Apache License 2.0
13 stars 3 forks source link

Doc's example fails #16

Open vargonis opened 3 years ago

vargonis commented 3 years ago

From the docs, one could do:

using VPTrees
using StringDistances

data = ["bla", "blub", "asdf", ":assd", "ast", "baube"]
vptree = VPTree(data, Levenshtein()))
query = "blau"
n_neighbors = 3
data[find_nearest(vptree, query, n_neighbors)]

However, that fails with MethodError: no method matching iterate(::DataStructures.BinaryMaxHeap{Tuple{Int64, Int64}}).

altre commented 3 years ago

Thanks for the issue! I can't reproduce this. I'm guessing this is a problem with the DataStructures version compatibility restriction. Could you share which version you have installed?

vargonis commented 3 years ago

Sorry about this, I should have saved/posted the environment in order to reproduce later. With my current environment it works. But it does seem like a missing lower/upper bound compat problem with DataStructures.

By the way, I was trying to build some large trees but that took too long. I ended up breaking them into smaller pieces, but in the process I modified the construction algorithm to use threads to parallelize distance computations, as opposed to spawning for children branches. I got a speed-up of 2-3x, so you might consider switching to that approach. I apologize for not offering myself to make a pull request in case you'd be interested; the problem is that I had to do that under time pressure and for the sake of simplicity I just got rid of all the logic that deals with the situation where there are no threads available.

altre commented 3 years ago

I'll keep this issue open for now and try to fix the version problem when I get around to it.

Thanks for the idea for a performance improvement. I'm guessing it depends on the expense of the distance metric used. Perhaps you could share something I could use as a benchmark?

vargonis commented 3 years ago

Let me ask if I can share (it's a client's dataset). The metric is indeed expensive, custom distance for dataframe rows with some text columns, maybe similar to TokenMax in StringDistances.

vargonis commented 3 years ago

Sorry for the delay. So, I asked and for this particular dataset we are under NDA. However, I think it will not make much of a difference to just use a random string dataset and TokenMax.