JuliaAPlavin / FlexiJoins.jl

MIT License
11 stars 1 forks source link

Join by distance on strings #8

Open aplavin opened 11 months ago

aplavin commented 11 months ago

In GitLab by @Loualiche on Oct 27, 2023, 22:53

I thought naively that it would be easy to use the great StringDistances.jl package to do a fuzzy merge based on strings (say addresses or imperfect country names). There must be something that prevent the composition.

The exercise I was thinking of something of the sort:

innerjoin(
    (df1, df2),
     by_distance(:country_name, Partial(Levenshtein()), <=(5)),
    multi=(M=closest,)
)

Think of the country names being either "United States" in one table and "United States of America" in another. There are other examples (where there is no strict inclusion).


PS: I love this package.

aplavin commented 11 months ago

FlexiJoins uses NearestNeighbors.jl for performant distance joins. I believe the latter package only supports distances between vectors of numbers. Its codebase is relatively old, and is definitely not as generic as possible. Fundamentally, BallTrees can support arbitrary elements, it's just that NearestNeightbors.jl restricts itself to vectors: https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/ball_tree.jl#L6-L7.

So, either NearestNeighbors.jl improvements are needed, or another fast distance search package.

Naive O(n^2) join just works though! It needs to be opted in – feasible only if your data is small, of course:

julia> innerjoin((["A"], ["A", "AA", "AB", "BB"]), by_distance(identity, Levenshtein(), <=(1)), mode=FlexiJoins.Mode.NestedLoop())
3-element StructArray(view(::Vector{String}, [1, 1, 1]), view(::Vector{String}, [1, 2, 3])) with eltype Tuple{String, String}:
 ("A", "A")
 ("A", "AA")
 ("A", "AB")