davidagold / StructuredQueries.jl

Query representations for Julia
Other
53 stars 5 forks source link

Show methods for graphs #15

Open davidagold opened 8 years ago

davidagold commented 8 years ago

What information is best present/format most readable when showing the structure and contents of a query graph? I've been playing around with this a bit and have come up with something like:

julia> using jplyr

julia> type MyDataSource end

julia> mydata = MyDataSource()
MyDataSource()

julia> qry = @query mydata |>
           filter(a > .5) |>
           select(c = a * b) |>
           summarize(avg_c = mean(c))
jplyr.SummarizeNode with arguments:
    avg_c=mean(c)

→→  jplyr.SelectNode with arguments:
      c=a * b

   →→  jplyr.FilterNode with arguments:
        a > 0.5

      →→  Data source: MyDataSource()

I think it's on its way, but it needs some work. Suggestions are welcome in this issue.

richardreeve commented 8 years ago

I can understand why you'd have things in that order, given the resulting nesting of the function calls, but it does look very strange. It would be much better to reverse the order so that the MyDataSource feeds into the FilterNode, into the SelectNode and then the SummarizeNode. Then it would match the order of the pipes you started with, and also the natural order of operations.

The other issue is that people are very likely to want to see some output from the query, and not just the description of it... I assume the current plan is that you need to call collect() or some equivalent to generate the result? That makes sense in a lazy execution kind of way, but one thing I do like about dplyr is that it generates partial results for lazily evaluated results if you output them, so you can check you're doing the right things. e.g.:

> library(tibble)
> library(nycflights13)
> library(dplyr)
> fl <- nycflights13_sqlite()
> bad.ones <- fl %>% tbl("flights") %>% filter(arr_delay > 800)
> bad.ones
Source:   query [?? x 19]
Database: sqlite 3.8.6 [/var/folders/fv/1rqrvwq14ssggm_gc_x4v1cw0000gq/T//RtmpVmLND7/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      848           1835       853     1001           1950
2   2013     1     9      641            900      1301     1242           1530
3   2013     1    10     1121           1635      1126     1239           1810
4   2013    12     5      756           1700       896     1058           2020
5   2013    12    14      830           1845       825     1210           2154
6   2013    12    17      705           1700       845     1026           2020
7   2013    12    19      734           1725       849     1046           2039
8   2013     2    10     2243            830       853      100           1106
9   2013     3    17     2321            810       911      135           1020
10  2013     4    10     1100           1900       960     1342           2211
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

I think this is a very useful feature to assist correct construction of queries especially when the full calculation is computationally expensive.

johnmyleswhite commented 8 years ago

The dplyr-style display could be handled by defining show conditional on the presence of a data source. I believe David has in mind the problem of showing a query without a data source, where dplyr doesn't (as far as I can see) have any analogue.

davidagold commented 8 years ago

Actually, I've been assuming a present data source in the above (even though the fact that it's trivial in the example makes it seem otherwise). I suppose the tension is more between showing the return value of an @query as a graph of QueryNode objects (this is interesting to me because I'm developing the package and work with that structure) or showing the return value as a preview of what collect will yield when applied to it (this is probably more interesting to a typical user).

Slightly independently of this discussion, I've been considering introducing a Query type to wrap the graph of QueryNodes so that @query always returns a Query object (rather than one of the various leaf QueryNodes). What about something like the following mock-up?

julia> qry = @query mydata |>
           filter(a > .5)
A Query. Call 
    graph(::Query) to inspect the query graph; 
    preview(::Query) to preview the results; 
    collect(::Query) to return all results.

Or maybe showing a Query should automatically preview results and remind users of the graph and collect options?

EDIT: Though John makes a good point that I hadn't really thought much about, which is the ability to produce queries without a data source. Currently, that's not supported, but it should be, and then probably show should indeed depend on whether or not a source is present.

johnmyleswhite commented 8 years ago

I see. If you have a data source, I would show the result of collect plus a limit to at most 10 rows. I would add something like showgraph for developers to use.

yeesian commented 8 years ago

I believe David has in mind the problem of showing a query without a data source, where dplyr doesn't (as far as I can see) have any analogue.

It can be done, if we think of a Union{Symbol, QueryNode} as an abstract datasource (for the purposes of manipulation). Some experiments over at https://github.com/yeesian/SQLQuery.jl

I would add something like showgraph for developers to use.

We can call dump(::QueryNode), but it gets verbose fast. There's https://github.com/toivoh/julia-prettyshow#pretty-printing-new-types for some prior art.

richardreeve commented 8 years ago

I like the idea of a Query wrapper for a complete query. I wonder whether there isn't an opportunity here to separate a query without a data source and a query with one. The latter could produce a Query and show() would produce a preview (with showgraph() showing the graph for developers), whereas without a datasource, you just produce a graph of QueryNodes, and show() gives you the graph. You can then assemble a Query from the data source and the query graph.

davidagold commented 8 years ago

@richardreeve I think they should be separated as far as show is concerned, but the more I think about it, the more I think a Query should be returned in both cases (i.e., when there is a data source and when there isn't). I think most end users shouldn't have to touch any of the QueryNode objects. I'll make it so that if you have an incomplete (i.e., sourceless) Query, you can combine it with a source in preview and collect. So, preview(src, qry) would preview the result of collecting a sourceless query qry over a source src, and likewise collect(src, qry) would actually collect a sourceless query qry against src.

richardreeve commented 8 years ago

There's no problem with that, and I agree that

most end users shouldn't have to touch any of the QueryNode objects.

However, my counter to that is that most end users will never think of creating queries without an attached data source. That seems (to me) to be a pretty specialised thing, and so it's appropriate that it exposes some of the underlying mechanics when you do that. It also makes it crystal clear that you're handling a conceptually different thing - an isolated query graph rather than a fully implemented query.

davidanthoff commented 8 years ago

Running a query and even just getting the first 10 elements is potentially really costly/slow? If so, displaying even just the first 10 results might not be a good idea. I think unless you can guarantee that collecting the first n elements is more or less instantaneous, I would not show results in show.

richardreeve commented 8 years ago

I understand the concern, but I think the act of applying show to a fully instantiated query (i.e. with a data source) is a strong sign that you'd like a display of the output. The dplyr/tibble solution to this in R of calculating the output for a data frame, but just sending the SQL query to the database (and not collecting the full output), but then in both cases displaying the first page of results seems to me to be a good compromise between computational cost and helpfulness. After all, you never have to call show if you know the output will be costly.

davidanthoff commented 8 years ago

After all, you never have to call show if you know the output will be costly.

I thought this was mainly for REPL output? I just generally feel that if you have a var x and just want to see what it is on the REPL, that should never trigger anything that might block the REPL for a long time.

richardreeve commented 8 years ago

Absolutely, it is. When I said "call show", I meant type the name of the variable at the REPL. I understand the desire not to block the REPL for long periods of time, but the truth is we also want the REPL to be useful, and I don't think it's useful for the REPL to just spit out a tidied version of what you typed. For instance, if I type:

julia> type MyDataSource end
julia> mydata = MyDataSource()
MyDataSource()
julia> @query mydata |>
           filter(a > .5)

I wouldn't find it helpful for it to return:

Data source: MyDataSource()
→→  jplyr.FilterNode with arguments:
        a > 0.5

As discussed above, you can always call showgraph on a Query if you want to see the graph structure without enforcing any computation.

It's quite possible I'm wrong about this - maybe most people have (a) hugely computationally expensive queries (b) where the whole result has to be calculated and can't be cached when you want a preview and (c) call show (at the REPL) when they don't want to see any results. However, my feeling is that most people at the REPL will want a preview, and if they have a computationally expensive task they will likely use SQL, in which case you're fine given the nature of SQL database interfaces, where you can ordinarily send a query to the database and start retrieving output from it, so there is caching and therefore low computational wastage.

I have cases where this argument might fail myself - in particular where I intend to use Dagger for distributed processing of massive arrays, which could potentially be formed into a query at some point - but I see this as a corner case which should not interfere with the general utility of the REPL.