Open davidagold opened 8 years ago
I can understand why you'd have things in that order, given the resulting nesting of the function calls, but it does look very strange. It would be much better to reverse the order so that the MyDataSource
feeds into the FilterNode
, into the SelectNode
and then the SummarizeNode
. Then it would match the order of the pipes you started with, and also the natural order of operations.
The other issue is that people are very likely to want to see some output from the query, and not just the description of it... I assume the current plan is that you need to call collect() or some equivalent to generate the result? That makes sense in a lazy execution kind of way, but one thing I do like about dplyr is that it generates partial results for lazily evaluated results if you output them, so you can check you're doing the right things. e.g.:
> library(tibble)
> library(nycflights13)
> library(dplyr)
> fl <- nycflights13_sqlite()
> bad.ones <- fl %>% tbl("flights") %>% filter(arr_delay > 800)
> bad.ones
Source: query [?? x 19]
Database: sqlite 3.8.6 [/var/folders/fv/1rqrvwq14ssggm_gc_x4v1cw0000gq/T//RtmpVmLND7/nycflights13.sqlite]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 848 1835 853 1001 1950
2 2013 1 9 641 900 1301 1242 1530
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 12 5 756 1700 896 1058 2020
5 2013 12 14 830 1845 825 1210 2154
6 2013 12 17 705 1700 845 1026 2020
7 2013 12 19 734 1725 849 1046 2039
8 2013 2 10 2243 830 853 100 1106
9 2013 3 17 2321 810 911 135 1020
10 2013 4 10 1100 1900 960 1342 2211
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
I think this is a very useful feature to assist correct construction of queries especially when the full calculation is computationally expensive.
The dplyr-style display could be handled by defining show
conditional on the presence of a data source. I believe David has in mind the problem of showing a query without a data source, where dplyr doesn't (as far as I can see) have any analogue.
Actually, I've been assuming a present data source in the above (even though the fact that it's trivial in the example makes it seem otherwise). I suppose the tension is more between show
ing the return value of an @query
as a graph of QueryNode
objects (this is interesting to me because I'm developing the package and work with that structure) or show
ing the return value as a preview of what collect
will yield when applied to it (this is probably more interesting to a typical user).
Slightly independently of this discussion, I've been considering introducing a Query
type to wrap the graph of QueryNode
s so that @query
always returns a Query
object (rather than one of the various leaf QueryNode
s). What about something like the following mock-up?
julia> qry = @query mydata |>
filter(a > .5)
A Query. Call
graph(::Query) to inspect the query graph;
preview(::Query) to preview the results;
collect(::Query) to return all results.
Or maybe show
ing a Query
should automatically preview results and remind users of the graph
and collect
options?
EDIT: Though John makes a good point that I hadn't really thought much about, which is the ability to produce queries without a data source. Currently, that's not supported, but it should be, and then probably show
should indeed depend on whether or not a source is present.
I see. If you have a data source, I would show the result of collect plus a limit to at most 10 rows. I would add something like showgraph
for developers to use.
I believe David has in mind the problem of showing a query without a data source, where dplyr doesn't (as far as I can see) have any analogue.
It can be done, if we think of a Union{Symbol, QueryNode}
as an abstract datasource (for the purposes of manipulation). Some experiments over at https://github.com/yeesian/SQLQuery.jl
I would add something like showgraph for developers to use.
We can call dump(::QueryNode)
, but it gets verbose fast. There's https://github.com/toivoh/julia-prettyshow#pretty-printing-new-types for some prior art.
I like the idea of a Query
wrapper for a complete query. I wonder whether there isn't an opportunity here to separate a query without a data source and a query with one. The latter could produce a Query
and show()
would produce a preview (with showgraph()
showing the graph for developers), whereas without a datasource, you just produce a graph of QueryNode
s, and show()
gives you the graph. You can then assemble a Query
from the data source and the query graph.
@richardreeve I think they should be separated as far as show
is concerned, but the more I think about it, the more I think a Query
should be returned in both cases (i.e., when there is a data source and when there isn't). I think most end users shouldn't have to touch any of the QueryNode
objects. I'll make it so that if you have an incomplete (i.e., sourceless) Query
, you can combine it with a source in preview
and collect
. So, preview(src, qry)
would preview the result of collect
ing a sourceless query qry
over a source src
, and likewise collect(src, qry)
would actually collect a sourceless query qry
against src
.
There's no problem with that, and I agree that
most end users shouldn't have to touch any of the
QueryNode
objects.
However, my counter to that is that most end users will never think of creating queries without an attached data source. That seems (to me) to be a pretty specialised thing, and so it's appropriate that it exposes some of the underlying mechanics when you do that. It also makes it crystal clear that you're handling a conceptually different thing - an isolated query graph rather than a fully implemented query.
Running a query and even just getting the first 10 elements is potentially really costly/slow? If so, displaying even just the first 10 results might not be a good idea. I think unless you can guarantee that collecting the first n elements is more or less instantaneous, I would not show results in show
.
I understand the concern, but I think the act of applying show
to a fully instantiated query (i.e. with a data source) is a strong sign that you'd like a display of the output. The dplyr
/tibble
solution to this in R of calculating the output for a data frame, but just sending the SQL query to the database (and not collect
ing the full output), but then in both cases displaying the first page of results seems to me to be a good compromise between computational cost and helpfulness. After all, you never have to call show
if you know the output will be costly.
After all, you never have to call show if you know the output will be costly.
I thought this was mainly for REPL output? I just generally feel that if you have a var x
and just want to see what it is on the REPL, that should never trigger anything that might block the REPL for a long time.
Absolutely, it is. When I said "call show
", I meant type the name of the variable at the REPL. I understand the desire not to block the REPL for long periods of time, but the truth is we also want the REPL to be useful, and I don't think it's useful for the REPL to just spit out a tidied version of what you typed. For instance, if I type:
julia> type MyDataSource end
julia> mydata = MyDataSource()
MyDataSource()
julia> @query mydata |>
filter(a > .5)
I wouldn't find it helpful for it to return:
Data source: MyDataSource()
→→ jplyr.FilterNode with arguments:
a > 0.5
As discussed above, you can always call showgraph
on a Query
if you want to see the graph structure without enforcing any computation.
It's quite possible I'm wrong about this - maybe most people have (a) hugely computationally expensive queries (b) where the whole result has to be calculated and can't be cached when you want a preview and (c) call show
(at the REPL) when they don't want to see any results. However, my feeling is that most people at the REPL will want a preview, and if they have a computationally expensive task they will likely use SQL, in which case you're fine given the nature of SQL database interfaces, where you can ordinarily send a query to the database and start retrieving output from it, so there is caching and therefore low computational wastage.
I have cases where this argument might fail myself - in particular where I intend to use Dagger for distributed processing of massive arrays, which could potentially be formed into a query at some point - but I see this as a corner case which should not interfere with the general utility of the REPL.
What information is best present/format most readable when showing the structure and contents of a query graph? I've been playing around with this a bit and have come up with something like:
I think it's on its way, but it needs some work. Suggestions are welcome in this issue.