JuliaData / SplitApplyCombine.jl

Split-apply-combine strategies for Julia
Other
149 stars 15 forks source link

Keyword arguments for joins #13

Closed bramtayl closed 5 years ago

bramtayl commented 5 years ago

For clarity from the user (which order do the functions go in again?)

bramtayl commented 5 years ago

Left default lkey at identity for now...

codecov-io commented 5 years ago

Codecov Report

Merging #13 into master will decrease coverage by 41.62%. The diff coverage is 78.57%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #13       +/-   ##
===========================================
- Coverage   97.89%   56.26%   -41.63%     
===========================================
  Files           9       10        +1     
  Lines         237      407      +170     
===========================================
- Hits          232      229        -3     
- Misses          5      178      +173
Impacted Files Coverage Δ
src/innerjoin.jl 20% <ø> (-73.34%) :arrow_down:
src/SplitApplyCombine.jl 0% <0%> (ø)
src/leftgroupjoin.jl 48.93% <84.61%> (-51.07%) :arrow_down:
src/group.jl 48.75% <0%> (-51.25%) :arrow_down:
src/combinedims.jl 52.17% <0%> (-47.83%) :arrow_down:
src/map.jl 60.71% <0%> (-39.29%) :arrow_down:
src/invert.jl 67.67% <0%> (-30.86%) :arrow_down:
src/product.jl 70.83% <0%> (-23.62%) :arrow_down:
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 529ced1...8d80e40. Read the comment docs.

andyferris commented 5 years ago

Cool, I like where this is going. Deprecations would be necessary. CC @c42f

@bramtayl I think it would be good practice to provide a motivation for changes in the OP, even if it is somewhat obvious, rather than leaving it blank.

Left default lkey at identity for now...

I wonder if we should make it mandatory for now? It's a breaking change anyway, and it would allow us to make natural joins work later on.

bramtayl commented 5 years ago

Ok, added a bit of motivation. If we're making breaking changes, making we could workshop the argument names a bit too? f is a bit too non-de-script, what about combine or together? Also, maybe by instead of lkey and by_right instead of rkey? Choosing by to mirror Base.sort(; by = identity)

c42f commented 5 years ago

Something like this will help a lot :+1:

On balance I prefer lkey and rkey, I think. You could rename f to merge I suppose?

(A semi-related thought strikes me — would it also be harmless and convenient to allow Symbols for those so that lkey=:age would mean lkey=getproperty(:age)?)

andyferris commented 5 years ago

Can we just define (a::Symbol)(x) = getproperty(x, a)?

andyferris commented 5 years ago

f exists in lots of places. I keep thinking mapreduce(f, op, itr) should be more like reduce(op, itr; map=f) or similar. There’s also group and product and so-on.

(Some of these have function arguments strategically placed for do, but apart from that I think keywords are better in general).

Also need to consider potentially weaker constant propagation for keyword args (maybe this is fixed on Julia 1.1?)

bramtayl commented 5 years ago

If you import LightQuery then Name(:a) and Names(:a, :b) will both work in place of functions and you won’t have to worry about constant propagation. Another thing to think about is that f would have to default to merge if you want natural joins to work out of the box

c42f commented 5 years ago

Can we just define (a::Symbol)(x) = getproperty(x, a)?

Well... obviously not in this package. But in Base... maybe?! It's kind of simultaneously great and worryingly surprising. Though what else it could mean, I'm not sure.

julia> (a::Symbol)(x) = getproperty(x,a)

julia> :im(Complex(1,2))
2

heh heh.

c42f commented 5 years ago
julia> data = [(a=1,b=2), (a=10,b=20)]
2-element Array{NamedTuple{(:a, :b),Tuple{Int64,Int64}},1}:
 (a = 1, b = 2)  
 (a = 10, b = 20)

julia> :a.(data)
2-element Array{Int64,1}:
  1
 10

very amusing :-)

andyferris commented 5 years ago

Right - it's quite funny, not sure if it is a "serious" thing but this is almost what you mean when you start accepting symbols in the place of functions, so why not take that to it's logical conclusion? :rofl:

I'm thinking a nicer interface would be something like how broadcast uses broadcastable:

callable(f::Callable) = f # Callable === Union{Type, Function}
callable(s::Symbol) = x -> getproperty(x, s)
...

innerjoin(l, r; lkey, rkey, f, compare = isequal) = _innerjoin(l, r, callable(lkey), callable(rkey), callable(f), compare)
function _innerjoin(l, r, lkey, rkey, f, compare)
    ...
end

And do the same thing for group, leftgroupjoin, mapview, etc (the pattern might even be just as nice in Base for map, filter, findall, mapreduce).

EDIT: the reason for this version is this pattern is easily extensible for new types by new packages by overloading callable.

EDIT: @c42f callable could become the mechanism for halting the expansion of _ in the crazy _ function-making scheme like you have been speculating about :laughing:

andyferris commented 5 years ago

A generic version of this could become "select":

callable(x::NamedTuple{(:a, :b)}) = y -> (a = callable(x.a)(y), b = callable(x.b)(y))

map((a = :a, b = row -> 2*row.b), table)
 # or
(a = :a, b = row -> 2*row.b).(table)
bramtayl commented 5 years ago

I’ve been thinking for a long time that .a should be shorthand for Val{:a}(). Getting the symbols into the type domain asap saves having to ever worry about constant prop. Then you could call overload. (::Val{T})(x) where T = getproperty(x, T)

andyferris commented 5 years ago

I think _.a will soon mean what we want.

c42f commented 5 years ago

A generic version of this could become "select"

Ok, that's pretty clever... and a lot less crazy than some of the other ideas we're speculating about here. But can callable have a sensible default for values which are intended to be used as function objects? I think not?

bramtayl commented 5 years ago

Also yeah callable seems like a good idea, especially because LightQuery packages anonymous functions along with the expressions that generated them (for potential use in SQL translation).

bramtayl commented 5 years ago

Also yeah I really like the idea of calling NamedTuples with functions on other objects

nalimilan commented 5 years ago

While you're at it, in DataFrames we chose the lkey => rkey syntax instead of two separate arguments. That also allows passing a single value if the keys are the same in both inputs. It would be nice to use the same API here.

andyferris commented 5 years ago

That is pretty neat, @nalimilan. I was thinking maybe rkey = lkey as default keyword arg value would solve the same problem but without involving Pair.

c42f commented 5 years ago

Yes, that's pretty neat and allows for some interesting syntax for the common case, for example

innerjoin(customers, :id => :customer_id, orders)

without loosing the ability to take apart the lkey, rkey and comparison for accelerated queries (which is the underlying reason this interface is currently a bit confusing... right Andy?)

Hmm.. but on that note where is this package going? If I understood you right @andyferris part of what you're trying to do is to make a functional interface in this package which can act as a backend for various syntaxes? From that point of view the positional arguments seem like the right way to go, but that seems to conflict with making this package more usable in its own right (aka, this PR).

bramtayl commented 5 years ago

Ok... more thoughts. What about:

struct Match{Comparison, LeftKey, RightKey}
    comparision::Comparison
    left_key::LeftKey
    right_key::RightKey
end

Match(f) = Match(isequal, f, f)
Match(s::Symbol) = Match(x -> getproperty(x, s))

(m::Match)(t::Tuple{A, B}) where {A, B} = m.comparison(m.left_key(t[1]), m.right_key(t[2]))

Then really, what we got, is a way to decompose this into three lazy operations:

@> Product(left, right) |>
    Filter(Match(:a), _) |>
    Map(merge, _)

With the right optimizations, inner_join wouldn't even be needed any more.

Sorry, I'm doing that thing again where I propose not very well thought out very breaking changes

c42f commented 5 years ago

Such a lot of great ideas :-)

innerjoin(customers, Match(:id), staff)

innerjoin(customers, Match(:id, :customer_id), orders)

or in analogy to @Select and @Compute:

innerjoin(customers, @Match($id == $customer_id), orders)

Admittedly in that last one the rules for inferring which side we're pulling the fields from could get a little ropey, though restricted to simple cases it could be quite nice. A convention for naming table placeholders could disambiguate:

# Orders in first year... a bit contrived I know
innerjoin(customers, @Match($_2.date - $_1.date < Year(1)), orders)

with _1 and _2 being the left and right tables to be joined. Hmm, a bit ugly. Btw @andyferris have you thought about what happens when joining more than two tables together?

Speculating about all this, I feel I should know more about Query.

bramtayl commented 5 years ago

Ok, even more... I've been thinking for a long time about very lazy and very factored interfaces, and looking through the rest of the functions here I'm getting quite a few ideas... Here's a few...

groupreduce: map + reduce + group innerjoin: product + filter + map combinedims: flatten mapmany: map + flatten

So might it be fair to say, then, that given the right what I call function optimizations, e.g. Match is an example of a function optimization, another one would be Reduce(+) instead of sum, that really a lot of what is here could be simply optimizations of Base iterators that could some day make it into Base?

c42f commented 5 years ago

@bramtayl I think this is what Andy had in mind, the design of Andy's several data related packages is to be composable. For example, putting arrays from AcceleratedArrays into a Table from TypedTables should accelerate joins done with innerjoin from SplitApplyCombine.

c42f commented 5 years ago

I'm just an observer of what Andy's been doing btw, he can tell you the full story.