davidagold / StructuredQueries.jl

Query representations for Julia
Other
53 stars 5 forks source link

Take a look at LINQ.jl #9

Closed davidanthoff closed 8 years ago

davidanthoff commented 8 years ago

I had started some work on a LINQ port for julia a while ago. The goals for the package are essentially the same as jlplyr.jl. I'm not very far, but there is probably enough code to get a sense where the package is going and to figure out whether there is any chance to get performance and syntax to ever work.

The repo for this is here.

Would be great to get some feedback. I'm not sure the path in LINQ.jl is any better than what is attempted here, but it seems certainly worthwhile to compare the two approaches.

davidagold commented 8 years ago

Thank you for sharing your work! I was sorry not to be able to see your presentation in person.

One of the goals I'm trying to achieve is to disentangle the manipulation interface from an implementation vis actually manipulating an in-memory Julia table (be it a DataFrame or otherwise). My strategy for achieving this is to lower user manipulation commands to a graph and then in turn lower the graph based on the type of the underlying data source. It's not clear to me the extent to which you envision a similar objective for LINQ.jl. Could you elaborate on this?

Insofar as both of our approaches involve or will involve "lowering" user commands to a graph layer (or some other internal representation of the structure of the query), a comparison of our approaches seems in part a matter of taste. Does the user prefer a LINQ-style or dplyr-style api? It certainly seems like we could each offer a different approach to the user-facing api, and that that these can and should co-exist -- though it'd be especially nice if we could coordinate on how our user-facing apis lower to an internal representation of a query so that neither of us has to reinvent the wheel and to allow for interoperability.

As for comparisons of implementations for actual table objects, we can let benchmarks decide. Maybe it's best to lower the graph to an iterator involving named tuples. Maybe it's best to lower to something else, e.g. bitbroadcasting a kernel for filter. (Your tests suggest that the latter is more performant, at least for DataFrames.) I'm not wedded to any particular implementation -- my goal is to make everything modular enough so that it's relatively trivial to switch out implementations for a given data source type without touching the user-facing macros.

davidanthoff commented 8 years ago

Lets discuss this in JuliaStats/DataFrames.jl#1025.