JuliaAPlavin / FlexiJoins.jl

MIT License
7 stars 0 forks source link

v0.1.30 broke some joins on DataFrames #5

Closed aplavin closed 4 months ago

aplavin commented 1 year ago

In GitLab by @ericphanson on Jun 5, 2023, 14:17

using th example from https://github.com/beacon-biosignals/DataFrameIntervals.jl/issues/24#issue-1406829150, namely,

using DataFrames, Intervals, StableRNGs
using FlexiJoins

N_INTERVALS = 1000
N_POINTS = 10^5

function make_example(n_intervals, n_points)
    rng = StableRNG(123)
    points = [rand(rng, 1:10_000) for _ in 1:n_points]
    L = DataFrame(:point => points)
    transform!(L, :point => ByRow(p -> Interval{Int,Closed,Closed}(p, p)) => :interval)

    indices = map(1:n_intervals) do _
        i = rand(rng, 1:10_000)
        return i:i+100
    end
    R = DataFrame(:indices => indices)
    transform!(R, :indices => ByRow(I -> Interval{Int,Closed,Closed}(first(I), last(I))) => :interval)
    return L, R
end

L, R = make_example(N_INTERVALS, N_POINTS)
@time FlexiJoins.innerjoin((L, R), by_pred(:point, ∈, :indices))

I get

julia> @time FlexiJoins.innerjoin((L, R), by_pred(:point, ∈, :indices))
ERROR: MethodError: no method matching valtype(::DataFrame)
Closest candidates are:
  valtype(::Type{<:AbstractArray}) at abstractarray.jl:178
  valtype(::Union{FlexiMaps.MappedAny, FlexiMaps.MappedArray}) at ~/.julia/packages/FlexiMaps/xDulV/src/mapview.jl:51
  valtype(::AbstractArray) at abstractarray.jl:160
  ...
Stacktrace:
  [1] supports_mode(#unused#::FlexiJoins.Mode.Hash, cond::FlexiJoins.ByPred{typeof(∋), Accessors.PropertyLens{:indices}, Accessors.PropertyLens{:point}}, datas::Tuple{DataFrame, DataFrame})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/bypredicate.jl:51
  [2] choose_mode(mode::Nothing, cond::FlexiJoins.ByPred{typeof(∋), Accessors.PropertyLens{:indices}, Accessors.PropertyLens{:point}}, datas::Tuple{DataFrame, DataFrame})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/conditions.jl:21
  [3] which_side_first(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}}, multi::Tuple{typeof(identity), typeof(identity)}, nonmatches::Tuple{FlexiJoins.Drop, FlexiJoins.Drop}, groupby::Nothing, cardinality::Tuple{typeof(*), typeof(*)}, mode::Nothing)
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:89
  [4] _joinindices(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}}, multi::Tuple{typeof(identity), typeof(identity)}, nonmatches::Tuple{FlexiJoins.Drop, FlexiJoins.Drop}, groupby::Nothing, cardinality::Tuple{typeof(*), typeof(*)}, mode::Nothing, cache::Nothing, loop_over_side::Nothing)
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:59
  [5] _joinindices(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}}; multi::Nothing, nonmatches::Nothing, groupby::Nothing, cardinality::Nothing, mode::Nothing, cache::Nothing, loop_over_side::Nothing)
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:45
  [6] _joinindices(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:44
  [7] joinindices(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:40
  [8] joinindices(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:39
  [9] _flexijoin(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:30
 [10] _flexijoin(datas::Tuple{DataFrame, DataFrame}, cond::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:29
 [11] flexijoin(datas::Tuple{DataFrame, DataFrame}, args::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:27
 [12] flexijoin(datas::Tuple{DataFrame, DataFrame}, args::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:27
 [13] innerjoin(::Tuple{DataFrame, DataFrame}, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:9
 [14] innerjoin(::Tuple{DataFrame, DataFrame}, ::FlexiJoins.ByPred{typeof(in), Accessors.PropertyLens{:point}, Accessors.PropertyLens{:indices}})
    @ FlexiJoins ~/.julia/packages/FlexiJoins/gX62N/src/joins.jl:9
 [15] top-level scope
    @ ./timing.jl:262 [inlined]
 [16] top-level scope
    @ ./REPL[14]:100:

using FlexiJoins 0.1.30, FlexiMaps 0.1.14. This works on v0.1.29 (same FlexiMaps version).

aplavin commented 1 year ago

This exact code and versions work for me, I suspect you are using an old Julia version.

Dataframes support in FlexiJoins is subject to change, as noted in the docs. They aren't the focus of flexijoins and specifically marked as such for now -- the support is kinda rudimentary as I never use DF.jl myself, if someone is interested they are free to suggest integration improvements. Then it will likely be considered stable API.

For two months already, way before the actual release of 0.1.30, there was even a note in the docs on the upcoming version change: v0.1.30+ only supports dataframes on julia 1.9+.

aplavin commented 1 year ago

In GitLab by @ericphanson on Jun 5, 2023, 18:03

For two months already, way before the actual release of 0.1.30, there was even a note in the docs on the upcoming version change: v0.1.30+ only supports dataframes on julia 1.9+.

The way to indicate that is to bump the compat requirement in the Project.toml. Your Project.toml indicates compatibility with Julia 1.7+: https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/Project.toml#L36. I would recommend bumping the compability in a new patch release and yanking v0.1.30 from General, or retroactively adding the compat bound to the registry.

aplavin commented 1 year ago

Sure, I would've made a breaking version or bumped compat, if that integration was a part of the guaranteed public interface. Meanwhile, it's specifically described as experimental and subject to change. It still works without any changes as long as one uses the latest Julia. Actual stable functionality (everything aside from DF.jl integration) continues to work on Julia 1.8 (tests pass).

aplavin commented 1 year ago

In GitLab by @ericphanson on Jun 6, 2023, 07:23

Ah, I see. It is frustrating to me that FlexiJoins doesn’t support the Tables interface like most other packages.

Is there a simple way to convert a Tables.jl table to a supported type without dependencies? It sounds like a vector of NamedTuples or NamedTuple of vectors won’t work, since they don’t support simultaneous row and column indexing.

... On Tue, 6 Jun 2023 at 13:05, Alexander Plavin (@aplavin) < gitlab@mg.gitlab.com> wrote: > Alexander Plavin commented > : > > Sure, I would've made a breaking version or bumped compat, if that > integration was a part of the guaranteed public interface. Meanwhile, it's > specifically described as experimental and subject to change. It still > works without any changes as long as one uses the latest Julia. Actual > stable functionality (everything aside from DF.jl integration) continues to > work on Julia 1.8 (tests pass). > > — > Reply to this email directly or view it on GitLab > . > You're receiving this email because of your account on gitlab.com. > Unsubscribe > > from this thread · Manage all notifications > · Help > >
aplavin commented 1 year ago

FlexiJoins supports anything following Julia collection interface: regular arrays (of namedtuples or arbitrary objects), a wide variety of abstractarrays, tuples, and so on. Support for stuff like arbitrary iterators or dictionaries is incomplete as of yet, but also in scope.

Neither "Tables.jl-tables" nor "Julian collections" include the other category, so there was this choice of which of them to support first-class. For me, the answer is clearly collections, they are fundamentally more general than flat tables. As usual, I recommend simply using tables that follow Julian collection interface, there are many. And maybe advocate for adding collection interface to tables that don't support it (:

Support for Tables.jl can be added to FlexiJoins, but only if it doesn't affect collection handling. See https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/ext/DataFramesExt.jl for how conversion from dataframes is handled for now.

aplavin commented 1 year ago

In GitLab by @ericphanson on Jun 6, 2023, 08:34

We’ve had this discussion before but I still don’t know what is not-“flat” about a collection. It sounds just like what Tables.jl calls a “row table”. In which case, you could support any Tables.jl table by calling Tables.rows on it (or, since iterable support is incomplete, Tables.rowtable to get a materialized vector of NamedTuples).

... On Tue, 6 Jun 2023 at 13:59, Alexander Plavin (@aplavin) < gitlab@mg.gitlab.com> wrote: > Alexander Plavin commented on a discussion > : > > FlexiJoins supports anything following Julia collection interface: regular > arrays (of namedtuples or arbitrary objects), a wide variety of > abstractarrays, tuples, and so on. Support for stuff like arbitrary > iterators or dictionaries is incomplete as of yet, but also in scope. > > Neither "Tables.jl-tables" nor "Julian collections" include the other > class, so there was this choice of which of them to support first-class. > For me, the answer is clearly collections, they are fundamentally more > general than flat tables. As usual, I recommend simply using tables that > follow Julian collection interface, there are many. And maybe advocate for > adding collection interface to tables that don't support it (: > > Support for Tables.jl can be added to FlexiJoins, but only if it doesn't > affect collection handling. See > https://gitlab.com/aplavin/FlexiJoins.jl/-/blob/master/ext/DataFramesExt.jl > for how conversion from dataframes is handled for now. > > — > Reply to this email directly or view it on GitLab > . > You're receiving this email because of your account on gitlab.com. > Unsubscribe > > from this thread · Manage all notifications > · Help > >
aplavin commented 1 year ago

I still don’t know what is not-“flat” about a collection

For example, specifically for flexijoins, it's a fundamental part of the design that after J = innerjoin((;A, B)), J[1].A is an element from A, and J.A is a collection of elements from A. This is possible and feels natural when both A,B,J are collections, but not so much with flat tables.

you could support any Tables.jl table by calling Tables.rows

I could, if Tables.jl-tables were a specific julia type. Otherwise, an object can be a collection and implement Tables.jl interface, but these two meanings can be different. And I really think it's important to keep base Julian interface (collection-based) always consistent. So, always calling Tables.rows doesn't seem possible.

Also, the question is what the result should be. Currently, joining DataFrames returns a DataFrame with flattened columns: J.x, J.x_1, J.y instead of J.A.x, J.B.x, J.B.y. I guess that's what DF.jl users expect, but it requires special handling compared to regular julian collection join.

aplavin commented 1 year ago

As v0.1.30 only brings minor improvements, and I myself only use julia 1.9 anyway, I think this version can be easily yanked and the newer v0.1.31 will be compatible with julia 1.9+. https://github.com/JuliaRegistries/General/pull/85716