davidagold / StructuredQueries.jl

Query representations for Julia
Other
53 stars 5 forks source link

at-querying a Query #20

Open davidagold opened 8 years ago

davidagold commented 8 years ago

Suppose a user produces a Query:

qry = @query filter(:src, A > .5) |>
    select(B, C)

It seems reasonable that a user ought to be able to extend this query by using it as the source of another query:

qry2 = @query groupby(qry, C)

More specifically, collecting against qry2 should have the same result as collecting against

@query filter(:src, A > .5) |>
    select(B, C) |>
    groupby(qry, C)

I can see two ways to achieve this desired behavior:

  1. At the level of the Query object itself, via a constructor:
Query(source::Query, graph::QueryNode)
    graph.input = source.graph
    Query(source.source, graph)
end
  1. At the level of collect
collect(qry::Query, q::QueryNode) = collect(collect(qry), q)

I think I slightly prefer the first way.

yeesian commented 8 years ago

would it be qry2 = @query groupby(qry, C) or qry2 = @query groupby($qry, C)?

davidagold commented 8 years ago

It would be the former. If @query sees that a manipulation command, e.g. groupby is not piped an argument, then the macro assumes that the first argument must name a data source, rather than be a query argument. Interpolation is only necessary if the value appears in the context of a query argument, e.g. an expression to be mapped over columns.

yeesian commented 8 years ago

Interpolation is only necessary if the value appears in the context of a query argument, e.g. an expression to be mapped over columns.

Might that not be supported in the future?

davidagold commented 8 years ago

It will be supported, either as interpolation or something of a "prepared statements" API.

yeesian commented 8 years ago

It will be supported, either as interpolation or something of a "prepared statements" API.

Thanks for clarifying, so

qry2 = @query groupby($qry, C)

will result in a "prepared statement", whereas

qry2 = @query groupby(qry, C)

might result in something different?

davidagold commented 8 years ago

Oh, I see what you were asking. No, there won't be interpolation or prepared statements for data sources. I see both interpolation and prepared statements as answers to the question, How do I refer to a value outside the "scope" of @query inside a query argument to a manipulation verb? In my mind, the query argument realm -- i.e., non-data source arguments passed to manipulation verbs like groupby -- is entirely agnostic about how a data source is specified. Interpolation and prepared statements belong to that realm.

In the case of extending a Query, one is treating the Query object as a data source, and so mention of it within @query does not belong to the realm with which interpolation and prepared statements are concerned. The analogue to "interpolation" behavior for sources is the dummy source functionality.

Maybe a good way to summarize is: Interpolation/prepared statements lets you use different values in the same query, e.g. different values for c in filter(tbl, A > $c). Dummy sources let you collect the same query against different backends.

yeesian commented 8 years ago

That's a good enough distinction for me, thanks!

So it should be

x = 15
tbl1 = # some datasource
tbl2 = # some datasource
...
qry = @query :table1 |> innerjoin(:table2, ...) |> where(table2.col1 > $x)
collect(qry, table1 = tbl1, table2 = tbl2)

rather than

qry = @query :table1 |> innerjoin($tbl2, ...) |> where(table2.col1 > $x)
collect(qry, table1 = tbl1)

?

yeesian commented 8 years ago

Interpolation and prepared statements belong to that realm. [...] The analogue to "interpolation" behavior for sources is the dummy source functionality.

How about the following proposal(s):

davidagold commented 8 years ago

Actually, I think all mentions of dummy sources within @query will require prepending with :. So it would be

qry = @query :table1 |> innerjoin(:table2, ...) |> where(:table2.col1 > $x)
collect(qry, table1 = tbl1, table2 = tbl2)

I take the table2 without prepending the : to be a direct reference to the object table2 in the scope in which @query is invoked. So the above would be equivalent to

@collect tbl1 |> innjerjoin(tbl2, ...) |> where(tbl2.col1 > $x)

whereas

qry = @query :table1 |> innerjoin(:table2, ...) |> where(table2.col1 > $x)
collect(qry, table1 = tbl1, table2 = tbl2)

would be equivalent to

@collect tbl1 |> innerjoin(tbl2, ...) |> where(table2.col1 > $x)

An alternative to having to repeatedly prepend : would be using an alias:

qry = @query begin
    tbl = :table2
    table1 |> innerjoin(tbl, ...) |> where(tbl.col1 > $x)
end

As for your proposals, here are my thoughts:

for the analog of prepared statements, we introduce "dummy" placeholders (via :) to be filled in via keyword args incollect() later on.

I can see why you want to unify the dummy source and prepared statements functionalities, but I do like having the syntax reflect the conceptual distinction between collecting a (fixed) query against different sources and collecting a prepared query with varying parameter values against a fixed source. One may want (I don't exactly know why, but I don't see why we shouldn't support it) to bind different values to a parametrized (prepared) Query without collecting it, in which case the binding mechanism ought to be different than collect -- e.g. something like

qry = @query tbl |>
    filter(A > c::Int) |>
    select(B)

for _c in [1, 2, 3]
        bind!(qry, c = _c)
        do_something(qry)
    end
end

The second point concerning using : for both dummy sources and parametrized queries is that I think the syntax for the latter may need to include some way of specifying the type of values that the parameter will take. Though it's possible that this won't be necessary, and that we will be able to place function barriers inside the collect machinery for column-indexable tabular data structures in such a way that allows type inference to figure out what's going on when we map, say, a filtering lambda over not only a tuple of columns but also over query parameters.

Finally, there's an argument against using : to signify interpolation that applies equally to using : to designate query parameters, and that is that it renders the user unable to talk about Symbol literals in query arguments. For instance, tbl[:A] may be a column of Symbol objects, but if : denotes a query parameter then you can't naively express the query "select the subset of rows of tbl where the A attribute is equal to :a" with

@collect filter(tbl, A == :a)

-- you'd have to do

qry = @query filter(tbl, A == :a)
collect(tbl, a = :a) # or, as I'd prefer, `bind`

which I'm not really a fan of.

Note that the dummy source functionality doesn't run into this problem because, within @query, sources don't appear within query arguments. Now, if we use : to designate dummy sources and allow :alias.attribute as an identifier, then :alias does appear within a query argument. However, it is very distinguishable from a literal Symbol argument, since, as an object in a Julia AST, :alias.attribute is not a Symbol literal but rather an Expr with head :., and which may be parsed appropriately.

table.column_name and column_name are both allowed

Yes, and we will provide a definitive way to communicate to which source an un-prefixed column_name is to belong.

[If supported] for interpolation/splicing (via $) to retain the same meaning as they do in Julia MetaProgramming.

This tentatively sounds good, too -- I'll need to reason through this a bit more and see if it makes sense, since technically one is not interpolating into an Expr object as one does in Julia metaprogramming. But I agree with the spirit of this suggestion.

Also, I'll add that if I had to choose $ for use in either a prepared statements API or an interpolation API, I think I'd opt for the former, since I think it will accomplish what folks wish to do with the latter, but more efficiently.