bramtayl / LightQuery.jl

One query to rule them all
Other
14 stars 4 forks source link

Feedback from package authors #2

Closed bramtayl closed 5 years ago

bramtayl commented 5 years ago

So this is still experimental, but I'm dying for feedback from package authors. I hope this isn't rude. @davidanthoff @nalimilan I updated the readme to show how one macro can replace all of DataFramesMeta and Query. The syntax is slightly less terse but much more flexible.

bramtayl commented 5 years ago

Bunch of updates, new examples in the README. Added support for SplitApplyCombine @andyferris (though not tested yet)

bramtayl commented 5 years ago

Again sorry for the pings I just get overly excited about programming lol

bramtayl commented 5 years ago

Ok so I'm basically happy with whats here. I've got a chaining mechanism, a lazy call mechanism, and several model query verbs. I've got extremely unoptimized methods for these verbs on NamedTables. My hope is that people will import these methods and optimize them for various data structures, so that essentially DataFramesMeta could just hold the DataFrames methods and QueryOperators could just hold the Enumerable methods.

andyferris commented 5 years ago

Hi @bramtayl,

It's reassuring, in many ways, to see so many of us interested in the same set of problems. :) And it's great to be excited!

What I didn't really get (until maybe a hint in your most recent comment) was what the overall design philosophy was. Some kind of introduction in the README or documentation would probably be quite useful. Similarly, an example that shows how it all comes together. I'm taking this package as a convenient interface to manipulate NamedTuples and iterables of NamedTuples? I guess I'm asking what makes this approach unique compared to others? (Or, put bluntly, what itch are you trying to scratch?)

Anyway, I'm happy to discuss - let us know how we can be useful.

Added support for SplitApplyCombine

Can you elaborate on this?

I've got extremely unoptimized methods for these verbs on NamedTables.

I'm not sure what NamedTables refers to, exactly?

(I do love that you are thinking of these functions as "verbs". I've come to think of SplitApplyCombine as providing verbs, TypedTables as providing nouns and AcceleratedArrays as providing "adverbs". I'm not sure if that is silly?)

bramtayl commented 5 years ago

NamedTables was a typo for NamedTuples. So basically this is just a straightforward port of dplyr from R. The methods only work on NamedTuples right now, but I think that they could work on just about anything vaguely tabular. QueryOperators has a set of unexported verbs with methods on Enumerables, and DataFramesMeta has a set of unexported verbs with methods on DataFrames. If instead both packages just extended the methods in this package, then we could have one uniform tabular query interface?

Previous iterations of this package had actually sat down and created these bridges. So for example, I had

group(data::Enumerable, n::Nameless) =
    QueryOperators.groupby(data::Enumerable, n.f, n.expression)
group(data::AbstractDataFrame, n) =
    DataFrames.groupby(data, n)
group(data, n) =
    SplitApplyCombine.group(n, data)

These are gone from the package at the moment by Occam's razor. But, for example, with a bit of elbow grease, I think most if not all of QueryOperators could be refactored as methods of LightQuery verbs. SplitApplyCombine is a bit trickier cause it's not explicitly indicated for tabular data...

I haven't taken a look at JuliaDB yet cause last time I checked it wasn't working on 1.0, but I'm pretty sure it could be integrated in a similar way.

andyferris commented 5 years ago

OK - thanks for the explanation.

If instead both packages just extended the methods in this package, then we could have one uniform tabular query interface?

It would be good to share the same fundamental operations, yes. I strongly feel that these should eventually become operations in Base (for things that are roughly equivalent in functionality to say map, reduce or filter) or a standard library (we have LinearAlgebra, imagine a RelationalAlgebra stdlib).

SplitApplyCombine is a bit trickier cause it's not explicitly indicated for tabular data...

Yes, indeed, my personal approach so far has been to see what generic operations are good for doing relational algebra. For example, if you take the textbook definition of a relation as a collection of (named) tuples, port that to Julia and say that this is any object that supports iterate and gives NamedTuple{names} (for the same names in each row), then we already have a whole bunch of functionality in Base. We can map rows to give new rows, we can filter rows, we can even make comprehensions and generators which even do an inner join say on column b of table1 and table2 along the lines of [(a = row1.a, b = row1.b, c = row2.c) for row1 in table1, row2 in table2 if row1.b == row2.b]. In other words, in Julia 1.0 I think we have a good chunk of this interface already.

If we can fill this out with all the groups and innerjoins and so-on that we need, I was hoping we wouldn't need a specific interface for tabular data, just simply rely on the standard Julia interface for containers. Really, there's a bit of trickery to get things like optimizations for column-based storage working, have lazy, chainable operations, and so-on, but IMO that's actually not too bad.

bramtayl commented 5 years ago

I'm with you that row-wise operations are basically reducible to Base iterators.

I'm still pretty sure you need specific interfaces for tabular data (or at least, they make things easier). Certainly select, remove, transform, and based_on are all specific to tabular data. And if you want to just be able to just say join_by(data1, data2, :b) instead of join_by(data1, data2, (i, j) -> i.b == j.b), you need assume a tabular structure.

So bottom line: relational algebra standard library: yes! This is kinda what I tried to set out to do here but I think it would be better to be a stdlib (or at least something more official)

andyferris commented 5 years ago

Yes, indeed - specific interfaces for tabular data will make things much, much more usable.

I guess I am thinking these interfaces should act as syntax sugar. Functions which take symbol names to identify columns would simply create closures or whatever and call higher-order functions like map, filter, reduce, group and innerjoin (that expect functions). Those higher-order functions should be smart enough to introspect the inputs and take advantage of columnar storage and so-on. For example, I curried getproperty(::Symbol) in TypedTables for exactly this purpose (a more general, mulit-column selector and transformer is also necessary but I haven't got to that yet!). Of course, I don't particularly want users to be typing getproperty(:b) everywhere (well, one day we might use the syntax _.b which isn't too bad). But IMO this is where we can build friendlier interfaces and macros on top of all that.

bramtayl commented 5 years ago

Ok, in that case, how about this for a proposal:

struct Keys{Names} end
Keys(names::Symbol...) = Keys{names}()
Keys(:a, :b, :c)

You could pass in Keys instead of an anonymous function into innerjoin, group, orderby, etc. and use dispatch to get the desired tabular data specific method?

bramtayl commented 5 years ago

So then here's what happens if you start trying to delete functions from LightQuery:

Don't really need any more:

as_rows/as_columns/pretty: probably better suited to a specifically built tabular data interface like TypedTables

where: filter + columnwise optimization orderby: sort + Keys + columnwise optimization group_by: SplitApplyCombine.group + Keys + columnwise optimization chunk_by: SplitApplyCombine.group + Keys inner_join: SplitApplyCombine.inner_join + Keys

I think still useful:

select/remove: really functions which should exist in Base, but can stay here for now? rename would be great here too. basedon/transform: really belong here or in a relational data stdlib @>: one of several way of chaining, still my favorite at the moment `@`: still pretty indispensible, I think. We're going to need a meta copy of the expression for SQL translation if that ever comes about, and the PR in Base on _ anonymous functions seems like it got stuck in the mud

andyferris commented 5 years ago

Sure, something exactly like that.

I have f = GetProperty{:a}() as the function that does f(x) = x.a. So I was thinking something like f2 = GetProperties{(:a, :b, :c)}() being the function that makes a named tuple f2(x) = (a = x.a, b = x.b, c = x.c).

Ideally we'd make a more powerful Select thing that can not only project columns but also transform and combine information from different columns, all in the one step.

bramtayl commented 5 years ago

True, but different functions would want to do different things with different columns. Like groupby would just want to select the columns. orderby would want to select the columns and then run isless. inner_join would want to select the columns and then test for inequality. So I think a dedicated Keys struct makes sense?

andyferris commented 5 years ago

Yes. (Note that I think TypedTables and SplitApplyCombine currently have the mechanics for columnar optimization for all the items on your list, so long as you are grouping or joining by just one column).

andyferris commented 5 years ago

I think the different steps could be composed? For example SplitApplyCombine.innerjoin has 4 different functions for doing the various things. This was unfortunately more complex that I first wanted but you need this level of flexibility.

    innerjoin(lkey, rkey, f, comparison, left, right)

Performs a relational-style join operation between iterables `left` and `right`, returning a collection of elements `f(l, r)` for which `comparison(lkey(l), rkey(r))` is `true` where `l ∈ left`, `r ∈ right`

So you do something like innerjoin(GetProperties(:a, :b), GetProperties(:a, :b), merge, isequal, table1, table2) to do an inner join where columns a and b must match.

bramtayl commented 5 years ago

Ok, well I've got a (probably not constant inferable) version of rename now:

export rename
"""
    rename(data; renames)

```jldoctest
julia> using LightQuery

julia> rename((a = 1, b = 2), :a => :c)
(b = 2, c = 1)
\```
"""
function rename(data::NamedTuple, renames...)
    olds = map(pair -> pair.first, renames)
    merge(
        remove(data, olds...),
        NamedTuple{map(pair -> pair.second, renames)}(select(data, olds...)...)
    )
end

So what would you think about registering a request in Base for dedicated select, delete, and rename methods on NamedTuples (with someone over there working some constant propagation magic)

Then I can just keep transform, based_on, and the two macros here as a tiny package?

bramtayl commented 5 years ago

innerjoin(GetProperties(:a, :b), GetProperties(:a, :b), merge, isequal, table1, table2) seems reasonable as long as there's reasonable defaults, like

y_selector = x_selector match_function = isequal

andyferris commented 5 years ago

So what would you think about registering a request in Base

Ultimately, yes. I suggest we first make a mini-interface for manipulation of objects with properties - what @quinnj calls the "PropertyAccessable" interface. We can do this in a small package with lots of prototyping to reduce churn in Base. Or make a Julep. Or whatever.

as long as there's reasonable defaults

Yeah, I'd love to work more on that... for example a natural inner join should be very easy to write (as easy as matrix multiplication).


To give an idea of what I want, LINQ has a pretty front-end syntax that slightly resembles SQL (that we can implement with macros, as you want here, and as done in Query.jl and in DataFramesMeta.jl) and C# lowering just transforms these to normal method calls. My innerjoin method is similar to the Microsoft one:

https://docs.microsoft.com/en-us/dotnet/api/system.linq.enumerable.join?view=netframework-4.7.2#System_Linq_Enumerable_Join__4_System_Collections_Generic_IEnumerable___0__System_Collections_Generic_IEnumerable___1__System_Func___0___2__System_Func___1___2__System_Func___0___1___3__System_Collections_Generic_IEqualityComparer___2__

The important thing about LINQ is that it also works on non-tabular data... you can use all these methods to traverse XML and JSON and whatever data structures you have at hand. There's no assumptions of columns, or of named tuples, or any of that, and hence the methods in SplitApplyCombine.

bramtayl commented 5 years ago

Ok cool I'm on board

andyferris commented 5 years ago

:smile:

bramtayl commented 5 years ago

Also +1 on natural joins

bramtayl commented 5 years ago

Ok, so based off of this feedback, I've:

removed all the rowwise functions (where, order_by, chunk_by, ungroup, inner_join) added some new functions specifically to fill out the "PropertyAccessible" interface:

name, rename gather, spread match_at, in_common, same_at, same curried versions of select, same_at, and same (for use with groups, joins, and natural joins respectively).

How does that look? What else would a property accessible interface need?

andyferris commented 5 years ago

Also +1 on natural joins

I really want t3 = t1 ⨝ t2 :)

How does that look?

I'm going to be swamped today, I'll have a dive in when I'm able.

davidanthoff commented 5 years ago

I think this all looks great, but what I don't understand how this is different from the Query.jl/QueryOperators.jl design, in a broad sense? For example, https://github.com/queryverse/QueryOperators.jl/blob/master/src/operators.jl is where I've defined the basic query operators (or verbs) for quite a while, and then the whole idea of having different backends, an iterator based fallback implementation that works with not just tables but anything etc. is all what has been the core design of Query.jl for a couple of years now.

I would really love to collaborate on all of this, but at the same time I would also very much not like to start from scratch, but ideally just evolve the existing implementations in Query/QueryOperators to gain new functionality. If there are some fundamental limitations in the design over there, it would be great to hear about them.

bramtayl commented 5 years ago

I've greatly reduced the scope of the package at Andy's suggestion. At this point it is just 1) a basic interface for operations on a single NamedTuple and 2) a couple of useful macros for Query-ing. So this package is perfect for interfacing with QueryOperators; for example, this could work:

using DataFrames: DataFrame
using Query: query
import QueryOperators

using LightQuery

@> DataFrame(a = [1, 2, 3], b = [1.0, 2.0, 3.0]) |>
    query(_) |>
    QueryOperators.map(
        (@_ transform(_, c = @_ _.a + _.b))
        _
    ) |>
    collect(_)

QueryOperators.map will just need a map(::Nameless, ::Enumerable) wrapper method.

bramtayl commented 5 years ago

And if you wanted to simplify the syntax a bit, you could just add a QueryOperators.transform convenience function which does the above:

    @> DataFrame(a = [1, 2, 3], b = [1.0, 2.0, 3.0]) |>
        query(_) |>
        QueryOperators.transform(_, c = @_ _.a + _.b) |>
        collect(_)

Or just overload and reexport the transform that's here for ::Enumerable

bramtayl commented 5 years ago

I've updated the package so it does constant propagation (mostly with a bunch of @inline calls). The constant propagation only works on master due to recent compiler improvements. Still stuck on getting rename, I think because we can't constant propagate through keyword arguments (yet).

nalimilan commented 5 years ago

I fully agree we should agree on a common minimal API for these operations so that they can be used with any data structure.

I haven't taken a look at JuliaDB yet cause last time I checked it wasn't working on 1.0, but I'm pretty sure it could be integrated in a similar way.

Cc: @piever for JuliaDBMeta

piever commented 5 years ago

JuliaDB and JuliaDBMeta have just been ported to Julia 1.0. In general most things are row wise there, so macros for working with NamedTuples are quite useful there.

You may want to check https://github.com/JuliaData/TableOperations.jl, which is an attempt of implementing queries directly in terms of the getproperty interface of rows in Tables.jl. In particular this allows several nice tricks in that you don't need to materialize the whole row but can simply create a custom object (not necessarily a NamedTuple) where getproperty does the right thing. For example I imagine that select(t, :x) where t is an iterator of "property accessible" objects, could be implemented by simply changing propertynames and tricks like that.

andyferris commented 5 years ago

I think because we can't constant propagate through keyword arguments (yet).

I'm really looking forward to that one.

bramtayl commented 5 years ago

Oooh TableOperations looks exciting. It looks a little less fully featured than I would like, though. Are there more functions planned (e.g. remove, rename, based_on, gather, spread, etc.)?

piever commented 5 years ago

Are there more functions planned (e.g. remove, rename, based_on, gather, spread, etc.)?

I think that's where you come in :)

More seriously, I planned to contribute some things but don't really have the resources right now. From what I understand @quinnj just put up a proof of concept to have a place where we can gather all the various implementations of things that can be expressed purely in this getproperty interface.

bramtayl commented 5 years ago

I mean that's kinda what I did here too. I'm happy to pitch in wherever I can. Down with dplyr.

davidanthoff commented 5 years ago

We've been working for a while on adding similar things to Query.jl/QueryOperators.jl, see https://github.com/queryverse/Query.jl/pull/209 and https://github.com/queryverse/Query.jl/pull/213.

We originally created NamedTupleUtilities to hold all the utilities that make NamedTuple manipulation easier. For now we've decided to have the code in QueryOperators.jl, though, until we have settled on a final, more stable interface. Just makes it easier to iterate. But we do plan to move these into their own package eventually, so I'm generally very interested in a package that holds named tuples helpers.

bramtayl commented 5 years ago

Then maybe it makes sense to merge the NamedTuples stuff from here into NamedTuplesUtilities? Then this package would just sink back into just holding two macros.

bramtayl commented 5 years ago

@andyferris did you get a change to take another look? @piever I took a second look at TableOperations and it seems like it doesn't really quite pass Occam's razor. Why can't named-tuple-like structures simply overload the methods here (or if not here, then wherever standardized named tuple operations will live)

piever commented 5 years ago

Why can't named-tuple-like structures simply overload the methods here (or if not here, then wherever standardized named tuple operations will live)

I'm not sure I follow, but I think the idea is that a row of a table is whatever object implements getproperty and propertynames. The entries are defined by what getproperty returns for the various propertynames. The basic idea is that to implement something that is compliant with this interface, you shouldn't assume a specific type but should write everything in terms of getproperty and propertynames and return any type of object that also implements the interface. The idea of an interface (think AbstractArray interface or iterator interface) is that by overloading a minimum number of methods you get a lot of functionality.

You may choose as a return type for your functions whichever object implements the interface (it can always be a NamedTuple but also something else if for whatever reason there are more convenient alternatives).

bramtayl commented 5 years ago

Ok, got it, so then what I really need to do is go through the NamedTuple operations in LightQuery and see if I can reduce them to only use getproperty and propertynames, then PR the results to TableOperations?

bramtayl commented 5 years ago

Ok, well I did a bunch of refactoring to get LightQuery to only use getproperty and propertynames. Still having two constant prop issues (one in rename, one in unname for structs) that I think are really things for Base to work on.

bramtayl commented 5 years ago

Oops forgot to push its up now

bramtayl commented 5 years ago

Got stalled here https://github.com/JuliaData/Tables.jl/issues/47

bramtayl commented 5 years ago

@andyferris I put up a fuller version and I'm excited about it

andyferris commented 5 years ago

@bramtayl Sorry, I unfortunately haven't had much time for Julia in the last month. I will say that what you've got looks useful. I'm sure it's a collection of tools that help you get stuff done :) For example, gather and spread seem pretty useful!

(To explain why I emphasised useful: I've been avoiding writing anything too useful because I'm mostly trying to persue/understand the right abstractions, which for me at least is a very slow process...)

bramtayl commented 5 years ago

Useful sounds good to me. Hopefully the experiment here can help you figure out what the "right" abstractions are.