MechanicalRabbit / FunSQL.jl

Julia library for compositional construction of SQL queries
https://mechanicalrabbit.github.io/FunSQL.jl
Other
146 stars 6 forks source link

bound references are not rebased #13

Closed xitology closed 6 months ago

xitology commented 2 years ago
using FunSQL: SQLTable, From, Define, Where, Get, render

person = SQLTable(:person, :person_id, :year_of_birth, :location_id)

ByYOB(yob) =
    (q = Define(:yob => Get.year_of_birth)) |>
    Where(q.yob .== yob)

q = From(person) |>
    ByYOB(1950)

print(render(q))
#=>
ERROR: GetError: cannot find yob in:
let person = SQLTable(:person, …),
    q1 = From(person),
    q2 = q1 |> Define(Get.year_of_birth |> As(:yob)),
    q3 = Define(Get.year_of_birth |> As(:yob)),
    q4 = q2 |> Where(Fun."=="(q3.yob, Lit(1950)))
    q4
end
=#
ananis25 commented 2 years ago

Just ran into this issue. If I understand correctly, ~binary operators in programming languages are typically left associative~ (associativity is not relevant, what matters is if we store relationships in the parent node, or the child node), so rebasing q >> Where(q.yob ...) produces a copy of the node q to put in the final query tree, while the original q is left detached.

It is an interesting conundrum. When the bound reference is created, q could also store a reference to Get.yob, to go back and update it whenever q is rebased. However, this makes tabular nodes stateful which might be problematic when reusing nodes. Do you see any other way around it?

ananis25 commented 2 years ago

So, this error is not limited to bound references. Consider the following example.

table = SQLTable(:table, columns=[:x, :y])
function self_join()
    q = Where(Get.x .== 100)
    return q |> Join(q |> As(:alias), on=Get.x .== Get.alias.x) |> Select(Get.y)
end

query = From(table) |> self_join()
print(render(query))

# ERROR: FunSQL.ReferenceError: cannot find x in:
# let table = SQLTable(:table, …),
#     q1 = From(table),
#     q2 = Get.x,
#     q3 = Lit(100),
#     q4 = Fun."=="(q2, q3),
#     q5 = q1 |> Where(q4),
#     q6 = Where(q4),
#     q7 = q5 |> Join(q6 |> As(:alias), Fun."=="(Get.x, Get.alias.x)),
#     q8 = q7 |> Select(Get.y)
#     q8
# end

So, the issue really seems to be that FunSQL queries are not trees but DAGs. When we rebase the query pipeline A-B-C, we leave node E detached. dags_not_trees

I haven't found a nice way to solve this yet.

  1. Tracking bound references with a tabular node as parent in the parent isn't sufficient, since Join nodes see the same error.

  2. The simple solution is to make the query tree mutable and make rebase an operation that just descends down the line of parent nodes, and swaps the null value at the end. That is, X |> (A |> B |> C) just replaces the parent node for A.

    However, now we need to ensure query fragments are never reused, which is not nice. The neat thing about FunSQL is that the data structures are kind of immutable, and can be freely composed.

  3. Visit and collect all SQLNodes in the query tree that need to be rebased; go over them in the topological order, i.e A, B, E, C in the diagram and make a copy of each.

    Given each SQLNode has multiple attributes that could be nodes themselves, this would be an expensive and tedious.

What do you think?

xitology commented 2 years ago

It is an interesting conundrum. When the bound reference is created, q could also store a reference to Get.yob, to go back and update it whenever q is rebased. However, this makes tabular nodes stateful which might be problematic when reusing nodes. Do you see any other way around it?

This is the place where FunSQL is a bit inconsistent. In general, FunSQL ignores the node identity, which means that any node in a query can be replaced with its structural copy. However, this is not valid when the node has a bound reference. In particular, a bound reference can be invalidated by query composition (|>) because composition rewrites its right argument.

q = a |> b |> Where(b.x)

Here, composition a |> b deconstructs the original object b, which invalidates the reference b.x.

This issue can be fixed by having composition preserve a reference to the original node, but is it worth it considering that the problem is pretty obscure? An easy workaround is to rewrite the query like this:

b = a |> b
q = b |> Where(b.x)
ananis25 commented 2 years ago

Hmm, that is a good point; it indeed is not a common occurrence. Though it goes against the FunSQL idea a bit, that sql queries written in modular functions can be composed freely.

For example, I can imagine people factoring out self-joins as a separate function. If the output node from self_join ever gets rebased, the right hand side of the join will be left detached.

function self_join(q, col_name)
    return q |> Join(q |> As(:alias), on=Get(col_name) .== Get.alias >> Get(col_name))
end

The solution is making sure users aware how rebase works, which I feel is an abstraction leak.

clarkevans commented 2 years ago

This issue can be fixed by having composition preserve a reference to the original node

The downside, besides implementation time, is added maintenance due to more complex code base?

xitology commented 2 years ago

So, this error is not limited to bound references. Consider the following example.

table = SQLTable(:table, columns=[:x, :y])
function self_join()
    q = Where(Get.x .== 100)
    return q |> Join(q |> As(:alias), on=Get.x .== Get.alias.x) |> Select(Get.y)
end

query = From(table) |> self_join()
print(render(query))

# ERROR: FunSQL.ReferenceError: cannot find x in:
# let table = SQLTable(:table, …),
#     q1 = From(table),
#     q2 = Get.x,
#     q3 = Lit(100),
#     q4 = Fun."=="(q2, q3),
#     q5 = q1 |> Where(q4),
#     q6 = Where(q4),
#     q7 = q5 |> Join(q6 |> As(:alias), Fun."=="(Get.x, Get.alias.x)),
#     q8 = q7 |> Select(Get.y)
#     q8
# end

So, the issue really seems to be that FunSQL queries are not trees but DAGs. When we rebase the query pipeline A-B-C, we leave node E detached. dags_not_trees

I haven't found a nice way to solve this yet.

1. Tracking bound references with a tabular node as parent in the parent isn't sufficient, since `Join` nodes see the same error.

2. The simple solution is to make the query tree _mutable_ and make `rebase` an operation that just descends down the line of parent nodes, and swaps the null value at the end. That is, `X |> (A |> B |> C)` just replaces the parent node for A.
   However, now we need to ensure query fragments are never reused, which is not nice. The neat thing about FunSQL is that the data structures are kind of immutable, and can be freely composed.

3. Visit and collect all SQLNodes in the query tree that need to be rebased; go over them in the topological order, i.e `A, B, E, C` in the diagram and make a copy of each.
   Given each SQLNode has multiple attributes that could be nodes themselves, this would be an expensive and tedious.

What do you think?

You cannot construct this DAG by reusing the object q. This is because FunSQL ignores the identity of the node objects (bound references are an exception), which means any node in a query could be replaced with a copy without changing the meaning of the query. In your example, it means you could equivalently write

q1 = Where(Get.x .== 100)
q2 = Where(Get.x .== 100)
return q1 |> Join(q2 |> As(:alias), on=Get.x .== Get.alias.x) |> Select(Get.y)

and it must produce exactly the same query.

One way to solve this is by moving the variable q to the FunSQL level, that is, using With to create a tabular "variable":

From(:q) |>
Join(From(:q) |> As(:alias), on = Get.x .== Get.alias.x)) |>
Select(Get.y) |>
With(From(table) |> Where(Get.x .== 100) |> As(:q))

This does not yet let you write a self_join() function. To make self_join() possible, we need a variant With′ (need a better name) of With that exchanges the order of arguments: q1 |> With(q2) is equivalent to q2 |> With′(q1). Then you can write

self_join() =
    Where(Get.x .== 100) |>
    As(:q) |>
    With′(
        From(:q) |>
        Join(From(:q) |> As(:alias), on = Get.x .== Get.alias.x) |>
        Select(Get.y))

From(table) |> self_join()
ananis25 commented 2 years ago

You cannot construct this DAG by reusing the object q. This is because FunSQL ignores the identity of the node objects (bound references are an exception), which means any node in a query could be replaced with a copy without changing the meaning of the query. In your example, it means you could equivalently write

Right, would you consider "a node can be replaced with a structural copy" to be an implementation detail?

It definitely makes query tree transformations easy and error surface smaller by making nodes immutable, though I would have liked the self-join query to work :).

xitology commented 2 years ago

Right, would you consider "a node can be replaced with a structural copy" to be an implementation detail?

This is just a different way to say algebraic or referentially transparent.

xitology commented 6 months ago

Bound references are no longer supported in FunSQL v0.13.0.