JuliaNeighbors / Neighborhood.jl

A common API for finding nearest neighbors in Julia
https://julianeighbors.github.io/Neighborhood.jl/dev/
MIT License
12 stars 3 forks source link

Common API design #1

Closed Datseris closed 4 years ago

Datseris commented 4 years ago

@JonasIsensee had a good plan with:

ds = produce_structure(MyAlgType, data; kwargs...) # Initialize structure
do_preparatory_work!(ds; kwargs....) # Build your tree, graph or whatever
search(ds, query; kwargs...) # Do your searches 

# Implement these, if your structure allows that.
insert!(ds, new_point) # is new_point an actual vector or just an index to somewhere?
delete!(ds, point) 

but then there were 2 noteworthy comments.

In my eyes the return should be the same for all packages, and it should be either indices AND distances, or points AND distances. The package must know the distances to find the neighbhors, so why not return them anyway.

zgornel commented 4 years ago

insert! is a bit ambiguous as signature. Normally, the point indices (integer ids) must pe specified i.e. insert!(ds, point_data, idx) (this is difficult as it implies updating all indices > idx); delete! implies the same form of updating and these are expensive operations. push! and pop! (or pushfirst!, popfirst!) may be easier to implement in certain cases.

The API recommendations should follow (in the order of importance):

Datseris commented 4 years ago

I like very much your suggestion of

For the third argument of search, dispatch can occur on argument type: for ::Int - number of neighbors, for ::AbstractFloat range search.

Very Julian indeed :)

I typically find that the datastructure Vector{Tuple} is not so nice to use. More often than note I separate the vector into two vectors. It feels (at least for me) more natural that the return signature would be inds, ds which are Vector{I}, Vector{<:AbstractFloat}. The I is the index type that the neighbors package implements, which does not necessarily need to be always an integer.

zgornel commented 4 years ago

I agree; however, I am looking forward to https://github.com/JuliaLang/julia/issues/13942 as an elegant solution to this. Not aware if it is type stable

JonasIsensee commented 4 years ago

I like very much your suggestion of

For the third argument of search, dispatch can occur on argument type: for ::Int - number of neighbors, for ::AbstractFloat range search.

Very Julian indeed :)

I have concerns on dispatching on Int vs. AbstractFloat. It would certainly work but would be very opaque for new users and not be extensible. (Are there more than fixed-number, fixed-range searching schemes? A slightly more verbose but more extensible approach could be to define

struct FixedSizeNeighborhood{T<:AbstractFloat}
r::T
end

and dispatch on that.

zgornel commented 4 years ago

It would certainly work but would be very opaque for new users and not be extensible.

I disagree simply because virtually all search packages implement this approach - so, it is a popular and maybe more natural choice . Assuming that there is a minimal degree of documentation associated to the search function, I see no point in changing what has already been deemed by all developers a de-facto default.

altre commented 4 years ago

One problem with dispatching on Int and AbstractFloat: Levenshtein Distance is a metric, but returns integer values. Querying this with float does not seem like a natural API and would confuse me as a library user.

I would prefer two different methods for range and nearest neighbor search.

The return values would be fine for the current implementation of VPTrees, however I could imagine there being a datastructure where it is not necessary to compute all distances to the returned values. How is it for example with range search in a KDTree? I suppose it depends on the application if they can profit from precalculated distances.

zgornel commented 4 years ago

Separating range and nn search does make sense as well.

however I could imagine there being a datastructure where it is not necessary to compute all distances to the returned values

I do not quite understand this. Does it mean returning only indices without the distances (contrived situation: fetching a full cluster of points based on centroid distance of sorts?) In this case, distance values could be missing...

dillondaudert commented 4 years ago

I'm currently working on reimplementing NearestNeighborDescent.jl, I thought it might be useful for this discussion to share my design sketch for the API.

There's a usage pattern that follows similarly to what's proposed here:

graph = nndescent(::Type{<:ApproximateKNNGraph}, data, n_neighbors, metric)
indices, distances = search(graph, data, queries, n_neighbors, metric)

However, there's a lot of other functionality that's implemented / planned that goes beyond just index construction and searching (some of that is described in these dev notes).

The subtypes of ApproximateKNNGraph are true AbstractGraphs, so they implement that interface. I also plan to add more ways to construct and refine knn graphs (e.g. the various merging algorithms proposed in this paper). This is all being put in a submodule of NNDescent called KNNGraphs, but I mention it here because there's the potential to pull it out into its own package if others find it useful.

dillondaudert commented 4 years ago

To add my thoughts on search, my opinion is that inrange should be a separate function. It really depends on the semantics in mind, but my understanding of the two is:

search(index, queries, k, metric) -> find the k (approximate) nearest neighbors to each query in the index. Importantly, this means that each query will always have k neighbors in the result (assuming k is less than the number of points in the original dataset). This can return the index of those neighbors or both the index and the distance, but I agree with what others have said that it should probably return both since the distances are calculated anyway.

inrange(index, queries, radius, metric) -> find all points within radius distance of each query in the index according to metric. This differs from search in that each query has an arbitrary number of neighbors within a given radius.

I think these are semantically different enough to warrant being separate functions. Furthermore, even if they are combined, you can't distinguish dispatch based on whether the third argument is an Int or AbstractFloat. The type of radius for range searches should probably be result_type(metric, x, y), which - like @altre pointed out - could very well be an integer type.

zgornel commented 4 years ago

The subtypes of ApproximateKNNGraph are true AbstractGraphs, so they implement that interface. I also plan to add more ways to construct and refine knn graphs (e.g. the various merging algorithms proposed in this paper).

This looks interesting. I wonder if a peer-to-peer approach to ann search can become feasible with this. If so, a julia search engine similar to yacy that has both online updates and distributed capabilities could become a killer app.

altre commented 4 years ago

In this case, distance values could be missing...

Good point.

inrange(index, queries, radius, metric)

Isn't metric already part of the index creation? For things like BallTree, VPTrees or BKTrees it would seem that you create the data structure for one specific metric and then would not need to pass it to the function. Is this different for NearestNeighborDescent.jl?

Datseris commented 4 years ago

(side comment: as it is not yet clear for me whether other authors want to have all NN-related packages under the same umbrella org, I have compiled this list: https://github.com/JuliaNeighbors/Neighborhood.jl/issues/2 that will help us have a summarized picture of what is out there)

Datseris commented 4 years ago

Re: on the search type specification. Both Jonas and Altre have good points, and simply dispatchin on Int vs Float is probably too naive. But Jonas already pasted the idea we use in DynamicalSystems.jl.

We can dispatch on

struct FixedRange{T}
ε::T
end

struct FixedMass
k::Int
end

(as these structs are already part of a NN ecosystem, no need to add the extra Neighborhood I think :) )

Datseris commented 4 years ago

@dillondaudert I must say I strongly disagree that inrange and knn searches should be different functions. Here is why:

At least in my line of work, the search itself is never the final outcome of the work I have to do. After I find the neighbhors I have to do stuff with them, as for example use them to estimate the Lyapunov exponent. Therefore, I want to be able to have as an argument the possibility to change the way I find the neighbors. If it is a function instead of an argument, then I have to change my code, and not my argument, to see how my result will change with different distance ε, or with different amount of neighbors k.

Besides, your very own comment already shows why these two things should be one function:

inrange(index, queries, radius, metric)
search(index, queries, k, metric)

is what you paste, and I can clearly see that the only difference here is the third argument, which more often than not also happens to have a different type. Julia's multiple dispatch comes to the rescue and reduces the multiplicity of 2 functions into 1.

(as @altre already said, the metric argument has no value, as it is included in the queries data structure)

Datseris commented 4 years ago

I see now that another wonderful benefit of having this common API is that users can easily benchmark each package for their specific application and thus pick the most suitable one!!! :)

zgornel commented 4 years ago

At least in my line of work, the search itself is never the final outcome of the work I have to do. After I find the neighbhors I have to do stuff with them, as for example use them to estimate the Lyapunov exponent. Therefore, I want to be able to have as an argument the possibility to change the way I find the neighbors. If it is a function instead of an argument, then I have to change my code,

This is not really true as it is easily solvable through:

search(index, queries, k::FixedMass, metric) = search(index, queries, k.k, metric)
search(index, queries, r::FixedRadius, metric) = inrange(index, queries, r.ε, metric)

which presents the advantage of other packages not having to support/import/care FixedMass, FixedRadius. Granted, not even all signatures will be the same yet having two different methods seems a better lower-level API that IMHO suits better the potential diversity of the packages that do neighbor search in general. FixedMass and FixedRadius are objects that ideally should not be implemented nor imported as are not needed. Method APIs are somewhat easier to understand and follow that data ones.

Datseris commented 4 years ago

I am confused, or I may not have understood your comment, because I think you are contradicting yourself... :S

You claim that supporting/caring about FixedMass is not a good idea. But then you say

yet having two different methods seems a better lower-level API that IMHO suits better the potential diversity of the packages that do neighbor search in general

In my eyes different methods mean the same function with different argument types, i.e. multiple dispatch, i.e. what I suggested with FixedMass/FixedRange. And I fully agree that having two different methods is much better than having two different functions , which is the option @dillondaudert suggested. I feel like methods and multiple dispatch is not only more Julian, but will also lead to better code quality.

This is not really true as it is easily solvable through:

Yeap, that is true, but it is also on the user side. Why make the user do this, if the original API can be what the user would do, since it has less function names and takes advantage of Multiple Dispatch?

zgornel commented 4 years ago

In my eyes different methods mean the same function with different argument types

My bad, 2 functions ;)

Yeap, that is true, but it is also on the user side.

The user would not do it unless it needed a single method (bringing the user as an argument is highly debatable here). Just to summarize, there are two options:

In my view, the method approach is more problematic because puts an emphasis on all packages supporting FixedMass and FixedRange which are not implemented by any of the packages so far.

I am obviously taking the effort and consideration put into the work i.e. packages so far as a stronger indicator regarding where things should go rather than new choices.

Datseris commented 4 years ago

(bringing the user as an argument is highly debatable here).

? We are discussing a common api so that people can use nn searches in a common manner. The usage of this API should be the primary concern, no?

In my view, the method approach is more problematic because puts an emphasis on all packages supporting FixedMass and FixedRange which are not implemented by any of the packages so far.

There is no difference whatsoever in what you say. in 1. you support 2 functions, in 2 you support 2 methods. There is no extra "weight" or "extra work" from the package's side whatsoever. It is literally the same amount of work. I don't get this argument at all.

dillondaudert commented 4 years ago

@Datseris As I previously explained, you cannot distinguish the third argument based on numeric type - radius may in fact be an Int, so the only way to differentiate the two is either by dispatching on other types (like FixedRange as suggested by others) or with different functions.

My original suggestion was for different functions, but I do kind of like the idea of FixedRange. I'm not sure FixedMass is the best name in the context of knn search. Certainly it would confuse me at first if search(..., FixedMass(k)) were the API for finding knns.

dillondaudert commented 4 years ago

Besides, your very own comment already shows why these two things should be one function:

inrange(index, queries, radius, metric)
search(index, queries, k, metric)

I was claiming that there's enough of a semantic difference between the two operations that different functions were warranted. You may or may not agree, but it isn't a counter-argument to say that they must be the same function since their signatures resemble one another - that's like saying min and max should be the same function since their signatures resemble one another.

I see now that another wonderful benefit of having this common API is that users can easily benchmark each package for their specific application and thus pick the most suitable one!!! :)

I agree, this would be one great benefit of a unified API

dillondaudert commented 4 years ago

In this case, distance values could be missing...

Good point.

inrange(index, queries, radius, metric)

Isn't metric already part of the index creation? For things like BallTree, VPTrees or BKTrees it would seem that you create the data structure for one specific metric and then would not need to pass it to the function. Is this different for NearestNeighborDescent.jl?

I included it just to be explicit. I haven't decided if metric will be included in the graph data structure, although most likely it will.

Datseris commented 4 years ago

My original suggestion was for different functions, but I do kind of like the idea of FixedRange. I'm not sure FixedMass is the best name in the context of knn search. Certainly it would confuse me at first if search(..., FixedMass(k)) were the API for finding knns.

I fully agree, and at the moment we should only consider FixedMass and FixedRange as placeholders. We were able to get away using these names in DynamicalSystems.jl because that's what the field uses, but in general I also don't think FixedMass is intuitive.

zgornel commented 4 years ago

? We are discussing a common api so that people can use nn searches in a common manner. The usage of this API should be the primary concern, no?

Ofcourse, this is implicit.

However, my remark tried to hint that you substituted personal choice (of having two methods) with user choice:

Yeap, that is true, but it is also on the user side. Why make the user do this,

At this point, most user usage patterns and needs are unknown and cannot be discussed as no factual elements on which to base decisions exist (hence the above mentioned substitution). I may be wrong here so if you have any evidence worth mentioning, please do share it.

Please note that this discussion should concentrate on subjective aspects of the API that we can agree on. This is not one of them. Thanks.

Datseris commented 4 years ago

@zgornel I apologize if I made you think that I personally think my personal suggestion is better than yours because it is mine. That is not at all the case nor my intention. It is just plain old multiple dispatch. Allow me to clarify:

Here is why I believe the Multiple Dispatch approach is subjectively better for a user perspective, or at least for the use case I describe. Imagine the scenario:

function lyapunovs(ref, data, structure, method)
    λ = 0.0
    for point in data
        neighbors = search(point, structure, method)
        rates = [expansion_rates(ref, n) for n in neighbors]
        λ += sum(rates)
    end
    λ/length(data)
end

This is a reasonable function that uses the neighbors in other manners once they are found. I truly hope this seems reasonable usage, and not as personal as you initially perceived.

Now, in the rest of my script I call lyapunovs for many different parameters, so it is nested in 2 loops.

Now, I want to compare my results for different types of finding neighbors.

  1. Using multiple dispatch: change the parameter method of my simulations from FixedMass(n) to FixedRange(ε). Simple change.
  2. Your suggestion (where the searches are different functions and the method argument does not exist in lyapunovs): Go to the source code of lyapunovs, change the line neighbohrs = search to use inrange instead of search. Keep in mind that since this is not a parameter change I cannot actually save this difference when I save the data. This is extremely important for scientific work: you want to be able to easily save changes in parameters. Also keep in mind that lyapunovs might come from someone else's code, so going to the source code might not be as trivial.

For me it is crystal clear why 1 is better than 2. If you think otherwise, please explain why. The more perspectives we have on this the better.


@zgornel since we are discussing an API for many people to use, please, just like you asked me, I now ask you: provide concrete example on why the 2-functions approach is more suitable. In the example above, for me the situation is clear. But once you provide an example which 2-functions should be better then I of course will reconsider. I have already seen that you have different use cases, so this will be interesting to see. This also adds on your argument:

At this point, most user usage patterns and needs are unknown and cannot be discussed as no factual elements on which to base decisions exist

I don't think this is entirely true. Every person in this discussion is a potential user. Combining all our uses cases already gives a very good estimate I believe, especially since our use cases are so diverse.


I am thinking about the argument "its more work for the developer" and how it stands. I still don't see how extending 2 functions is less work than extending 2 methods. In the end of the day you just write 2 Julia functions either way.

zgornel commented 4 years ago

@Datseris no apologies needeed, no offense was taken; I have already stated my arguments - mostly centered upon forced sharing of types and familiarity/consistency of existing signatures - and cannot contribute anything more substantial to the discussion, at least in this direction.

JonasIsensee commented 4 years ago

Uh, you gave me quite a bit of reading material. Having read all that I would like to raise one more point in favor of 2 functions.

Multiple dispatch is cool but it brings with it the expectation that methods sharing the same function name have essentially the same behaviour.

knn and inrange (or whatever you want to call them) make fundamentally different promises on what they return. Sure, it'll likely be of the same type but using inrange always brings with it the risk that no neighbors will be found and an empty array is returned.

I dare to assume that quite a few algorithms/applications rely on getting at least some neighbors returned or at least need to do some special-casing to avoid errors for empty arrays. Since we cannot know what kind of special-casing is needed in potential applications and user will have to differentiate between FixedMass or FixedSize herself anyway.

If your application happens to be fine with a variable amount of neighbors (that can be 0!) you can still easily implement something like

search(::FixedMass, args....) = knn(args...)
search(::FixedSize, args...) = inrange(args...)

(as we have done so far).

Datseris commented 4 years ago

Well, I can safely say that we were being stupid so far, not realizing how trivial it is to accommodate both options. Here is one way that solves literally all problems, and keeps everyone satisfied:

Neighborhood.jl (or however we'll call this at the end) exports these names:

make_datastructure(data, metric)
inrange(query, datastructure, ε; kwargs...)
knn(query, datastructure, n; kwargs...)
search(query, datastructure, method; kwargs...)
InRange
FixedAmount
insert!(ds, new_point) # is new_point an actual vector or just an index to somewhere?
delete!(ds, point) # is point an actual vector or just an index to somewhere?

where internally in Neighborhood.jl we just define

search(a, b, c::InRange; kwargs...) = inrange(a, b, c.ε; kwargs...)
search(a, b, c::FixedAmount; kwargs...) = knn(a, b, c.n; kwargs...)

The packages that accommodate the common API extend knn and inrange by dispatching on the datastructure type (and possible on the query type, it is for the individual packages to decide). ε is always the result of the Metric type used in the datastructure. if they can, they extend insert!, delete! again by dispatching on the datastructure type (and possibly the point type).

@zgornel @JonasIsensee @altre @dillondaudert Can we get a vote/critique on the suggested API?

zgornel commented 4 years ago

@Datseris LGTM. Regarding insert! and delete!, Base points to:

insert!(ds, index, new_point)  # index is some unique key/index in ds, similar to Base.insert!
delete!(ds, index)  # again index uniquely identifies a point; deleteat! may also be a name option

In both cases a partial update of ds (i.e. neighbor lists) may be needed.

Potentially faster methods are push!, pop!, pushfirst! and popfirst! which may be easier to implement depending on the structure. These are already supported by some indexing structures but may be too much in this context.

dillondaudert commented 4 years ago

@Datseris I think that looks like a good first proposal (except I would keep datastructure the first argument for inrange/knn/search). Maybe a small nitpick, but it seems more natural for package developers to implement

search(ds, query, FixedRange; ...)
search(ds, query, FixedAmount; ...)

and to have in Neighborhood.jl the convenience functions

inrange(ds, query, r; ...) = search(ds, query, FixedRange(r); ...)
knn(ds, query, k; ...) = search(ds, query, FixedAmount(k); ...)

Also I'm wondering if anyone has thoughts on which functions should be mandatory to implement (if any). For NearestNeighborDescent.jl, the knn graphs aren't mutable so adding/removing points won't be implemented. I also wasn't planning on supporting an inrange operation, but haven't really thought about that much yet.

Datseris commented 4 years ago

Also I'm wondering if anyone has thoughts on which functions should be mandatory to implement (if any)

I would say knn searches are the minimum requirement. But I think its worth it to consider having an inrange version as well, if it is simple enough to go about. As you noted mutability is only possible in a specific subset of the packages, so it can't be implemented for all.

Datseris commented 4 years ago

I've updated my proposed API because I realized that the absolute minimum for a common API is two functions:

  1. A unified approach that makes the search structure, like make_datastructure(data, metric)
  2. A knn search.

Optionally one should extend in range, and I strongly recommend doing so :P (this is a personal opinion by the way)

Datseris commented 4 years ago

The problem is, how do the individual packages extend make_datastructure ? Do we add a third argument package that is of type Val{Symbol}?

Datseris commented 4 years ago

Okay, I think the best way forward is to see some code. In the near future I'll make a PR here that implements this interface and invite everyone to review, but most importantly to give feedback on how to "make the datastructure" thing.

altre commented 4 years ago

Can't we just use the constructor like in NearestNeighbors.jl: BallTree(data::AbstractVector{V}, metric::M; datastructurespecifickeywords=....) where {V <: AbstractArray, M <: Metric} where Metric is one of the types defined in the Distances.jl package, depending on the specifics of the datastructure. For example in NearestNeighbors.jl, KDTree uses M <: MinkowskiMetric.

Datseris commented 4 years ago

Is the Levenstein distance mentioned here and on discourse also a Metric ? @altre

Yes, your suggestion seems to be totally valid for me. ST(data, metric; kwargs...) seems a reasonable api, with ST any search structure.

altre commented 4 years ago

No, in StringDistances.jl it isn't, but I think that is an error, might be just a simple PR away.

Datseris commented 4 years ago

But at the moment I don't see a reason to enforce types, i.e. why limit metric to be a subtype of Metric?

altre commented 4 years ago

Well, it might stop people from shooting themselves in the foot. Would you suggest M <: PreMetric or just pass a function?

Datseris commented 4 years ago

I would suggest to not add any type limitation to allow for extendability. At the moment I'll do it without limitations on the PR, and we can talk there about adding successive limitations.

JonasIsensee commented 4 years ago

Is the Levenstein distance mentioned here and on discourse also a Metric ? @altre

Yes, your suggestion seems to be totally valid for me. ST(data, metric; kwargs...) seems a reasonable api, with ST any search structure.

I'm wondering if this is in conflict with the idea of easily exchanging algorithms.

If you want to compare different algs in an application then you would need to pass the function name as a symbol of so and then eval?

Datseris commented 4 years ago

I have thought about that, which is why I have implemented

"""
    searchstructure(ST, data, metric; kwargs...) → st
Create a search structure `st` of type `ST` (e.g. `KDTree, BallTree` etc.) based on the
given `data` and `metric`. The data
"""
function searchstructure end
altre commented 4 years ago

You could also just use a list of constructors, similar to: fs = [BallTree, KDTree]; [g(data, metric) for g in fs]

dillondaudert commented 4 years ago

I wonder if it's really that onerous to have metric <: PreMetric ? What sort of contract should be expected for this argument, if not the interface defined in Distances.jl?

And btw, Levenshtein is indeed a SemiMetric, as are all the other measures in StringDistances.jl.

Datseris commented 4 years ago

@dillondaudert Its probably better to ask the opposite question: do you gain anything whatsoever in limiting all metrics to a subtype of PreMetric? If not, why do it? Not stating types keeps things general and extendable, see this talk for a good motivation: The Unreasonable Effectiveness of Multiple Dispatch .