Open AntonioL opened 2 years ago
That's an interesting use of a database. The WITH-RECURSIVE
could maybe be expressed with our s-string, but in place of pipeline during a table definition:
table t = s"VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100"
which could be packed into a function for easier use.
This would require #376 and a table flag to add the RECURSIVE
keyword.
This would require #376 and a table flag to add the
RECURSIVE
keyword.
Possibly we could also have an s-string that doesn't bind to anything; it's just:
s"""
WITH RECURSIVE t(n) AS (
VALUES (1)
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
"""
...and then there's always an escape hatch.
I would probably put this feature into the "ensure compatible" bucket, rather than "support natively", but if there were lots of π s on the issue as PRQL becomes more used, then we would work on supporting it in PRQL itself.
What about adding the concept of a table-valued function which can be marked as explicitly recursive?
In FSharp if a function is recursive is defined with let rec
Example:
let rec fa x = if x = 0 then x else x + fa (x-1)
I guess it can be a rec table
. Since today you materialize table
with a CTE
expression. The only thing left would be to supply the starting items in the iteration.
In the above example it is the first part of the query VALUES(1)
.
But I want to stress that the above is not really "recursive", it is more iteration, wish it had a different term. However I like the escape hatch approach, this will allow us to defer any decisions on this matter.
I picked up another example today from someone interested in PRQL that calls for recursive CTEs:
with recursive dim_balance_sheet(account_key, parent_account_key, account_name) as (
values
(1221,null,'Balance Sheet'),
(1272,1221,'Assets'),
(1300,1272,'Current Assets'),
(1301,1300,'Cash & Cash Equivalent'),
(1302,1301,'Bank Accounts'),
(1307,1300,'Inventories'),
(1222,1221,'Equity and Liabilities'),
(1223,1222,'Capital and Reserves'),
(1236,1222,'Current Liabilities'),
(1266,1236,'Deferred Tax'),
(2093,1236,'Deposits'),
(1380,1236,'Short Term Loans'),
(1232,1222,'Non-Current Liabilities'),
(1662,1232,'Long-term Loans'),
(1224,1662,'Shareholders')),
bs as (
select account_key,
account_name
from dim_balance_sheet
where parent_account_key is null
union
select dim_balance_sheet.account_key,
bs.account_name || ' > ' || dim_balance_sheet.account_name
from dim_balance_sheet
join bs
on bs.account_key = dim_balance_sheet.parent_account_key
)
select *
from bs
order by account_name
These Recursive CTE tree traversals are quite common in my experience. Another common example is traversing organograms and employee management structures.
See this example for finding a Manager and Employee Hierarchy using the AdventureWorks database for example: SQL SERVER β Simple Example of Recursive CTE
Interesting that MS SQL Server doesn't need the recursive
keyword. I've been using Postgresql for so long that I had forgotten that.
Given the use cases for this, I think it would be good if we could support this.
@max-sixty @aljazerzen Does the compiler need to be smart about this or couldn't we simply have an optional rec
keyword before the table
function that makes it a rec table
as suggested above (with the dialect awareness that for MS SQL Server for example rec
becomes a no-op)?
I think we're currently also lacking VALUES
and UNION ALL
but that's kinda orthogonal to this and addressed elsewhere.
I think we could add a table recursive:true
parameter to table. I admittedly haven't ever used RECURSIVE
, but I can definitely understand the appeal.
I probably wouldn't prioritize it that highly without more signal from people who want to use it (good point re not having UNION
!), but would be happy to see it as a contribution.
Another example is in #716 in what appears to be a graph traversal algorithm.
Another example I have seen in my $dayjob is for security classification trees, so you start with portfolio at the top and then walk down through asset classes. industries, sectors, subsectors, ... . For multi-asset class funds you might want variable depths of the classification tree in different asset classes. It's very similar to the balance sheet example above just with different labels.
I think it's definitely lower priority than UNION
(especially since it's dependent on it) but sounds like it shouldn't be that hard to do. I'll see if I can take a stab at it (since I have some time until UNION
lands).
Hi! Just chiming in with an example query I use to retrieve threaded topics on a side project of mine.
I don't believe this is currently possible with prql
, but would be amazing if it did.
The following is sqlite, but I understand there are equivalents in most modern RDBMS
-- This is the recursive Common Table Expression
WITH RECURSIVE 'POSTS_CTE' AS (
SELECT 'p'.'id',
'p'.'author_id',
'p'.'parent_id',
'p'.'content_id',
(COALESCE(p.parent_id, "x") || ' > ' || p.id) AS path
FROM 'posts' AS 'p'
UNION
SELECT 'p'.'id',
'p'.'author_id',
'p'.'parent_id',
'p'.'content_id',
(
pr.path || ' > ' || p.parent_id || ' > ' || pr.id
) AS path
FROM 'POSTS_CTE' AS 'pr'
INNER JOIN 'posts' AS 'p' ON 'pr'.'id' = 'p'.'parent_id'
)
-- And this is a simple SELECT statement consuming it
SELECT 'pct'.'id' AS 'post_id',
'pu'.'id' AS 'post_author_id',
'pct'.'parent_id' AS 'post_parent_id',
'cu'.'id' AS 'content_owner_id',
'c'.'id' AS 'content_id',
'pct'.'path' AS 'post_path',
'cu'.'username' AS 'content_owner_username',
'pu'.'username' AS 'post_author_username',
'c'.'type' AS 'content_type',
'p'.'title' AS 'post_title',
'c'.'content' AS 'content_content'
FROM 'POSTS_CTE' AS 'pct'
INNER JOIN 'content' AS 'c' on 'pct'.'content_id' = 'c'.'id'
AND 'c'.'type' = 'topic'
INNER JOIN 'users' AS 'cu' on 'c'.'owner_id' = 'cu'.'id'
INNER JOIN 'users' AS 'pu' on 'pct'.'author_id' = 'pu'.'id'
INNER JOIN 'posts' AS 'p' on 'pct'.'id' = 'p'.'id'
WHERE post_path LIKE "x > %"
ORDER BY post_parent_id, LENGTH('post_path') ASC
I use it to produce records of this shape: | post_id | post_author_id | post_parent_id | content_owner_id | content_id | post_path | content_owner_username | post_author_username | content_type | post_title | content_content |
---|---|---|---|---|---|---|---|---|---|---|---|
4c5e55b2-59ae-42de-96a0-7ebc72827c7d | b1caac6b-8761-411c-93c4-e62d91f61174 | NULL | b1caac6b-8761-411c-93c4-e62d91f61174 | cdadc188-3bea-4799-a4a5-bf31b5ba63bd | x > 4c5e55b2-59ae-42de-96a0-7ebc72827c7d | moderator | moderator | topic | moderator | {"title":"moderator","personal":false} |
Which I then use as base structure for a forum type platform, through a simple sqlite view (CREATE VIEW AS that whole thing described above
).
As for the potential DX for such a feature, I'm not sure I'm in a position to choose? But I do find the S-string format proposed by max-sixty fairly interesting
Thank you for posting your use-case. This will be useful when designing recursive functions/CTEs/something else that will handle these cases.
Good blog post from @frankmcsherry on this a couple of days ago! https://github.com/frankmcsherry/blog/blob/master/posts/2022-12-25.md
Hey folks, I will be posting my PRQL syntax proposal soon. In the meantime, if you really want to go in-depth on Recursive CTEs, here's a link to an academic paper that appears to be hot off the press:
Tweet: https://twitter.com/Teggy/status/1612401148459159552 Paper: A Fix for the Fixation on Fixpoints
Thanks for the link to @frankmcsherry 's blog post @max-sixty. My mental model of Recursive CTEs much simpler and basically just involves iteration. I'm certainly not an expert on the matter and welcome any corrections. I haven't looked up the details for Recursive CTEs in years but I have written a couple over that timeframe and they always worked the way I expected, and to me it also makes sense why the first example in that article only produced powers of two.
To me the term "Recursive CTE" is actually a bit of a misnomer and it's really a type of iteration.
Let's first look at the basic form of a Recursive CTE in SQL and a basic example and then I will explain my mental model:
WITH [RECURSIVE] expression_name (column_list) AS (
-- Anchor member
initial_query
UNION ALL
-- Recursive member that references expression_name.
recursive_query
)
-- references expression name
SELECT * FROM expression_name
So the simplest example I can think of is pretty much the same as @AntonioL posted in the origin comment:
WITH RECURSIVE t AS (
SELECT 1 AS n
UNION ALL
SELECT n+1 FROM t WHERE n<3
)
SELECT * FROM t;
You can try this out right now in the DuckDB Web Shell. This produces | n |
---|---|
1 | |
2 | |
3 |
Basically you can think of this as:
n = 1
yield n
while True:
n = n + 1
yield n
if not n<3:
break
The general case is the same but rather than a loop variable n
you have a loop set and iteration stops when the loop set is empty.
So in my mental model in a Recursive CTE rows are moved through 3 sets: "new" -> "intermediate" -> "final".
initial_query
is run, the rows are produced in the "new" set and are then advanced to the "intermediate" set.recursive_query
is then run and any reference to expression_name
uses the rows from the "intermediate" set and any rows are produced are placed in the "new" set. Basically something like the following in Python (thanks @aljazerzen ):
def loop(recursive_call, initial_call):
final = []
new = initial_call()
while not empty(new):
intermediate = new
final = final + intermediate
new = recursive_call(intermediate)
return final
A straightforward translation of the SQL Recursive CTE syntax to PRQL would be the following:
table recursive relation_name = (
initial_pipeline
append (
recursive_pipeline # that references relation_name
# i.e. starts with
# from relation_name
)
)
from relation_name
However I think we can do better than that. In particular I don't like the following about the SQL inspired form above:
relation_name
inside the table
definition matches the name that it is being assigned to.That second point in particular I got from this comment on the recent HackerNews thread Microfeatures I'd like to see in more languages. Based on that comment I looked up Clojure's loop/recur
syntax. I found the description in this page Clojure Guide: How to use loop and recur quite informative and that inspired most of what follows.
I think that SQL's Recursive CTEs actually work quite similarly to Clojure's loop/recur construct which as the one HN comment points out gives you a loop construct with just expressions.
Clojure's loop construct has three parts:
In our case, the base case is that when the loop is restarted with an empty set as the rebinding then the iteration stops. Since this is always the same we don't need to specify it and can leave it out of our construct.
We therefore get a function signature for the loop
transform that is something like the following:
func loop distinct<bool>:false binding_name recursive_pipeline initial_pipeline -> pipeline
Our general Recursive CTE in PRQL would then look something like:
table relation_name = (
initial_pipeline
loop binding_name (
recursive_pipeline # that references **binding_name**
# i.e. starts with
# from binding_name
)
)
from relation_name
Given that there is now no longer any reference to relation_name
in the loop construct, I was thinking that we actually no longer need the table
definition in PRQL and could probably write this as:
initial_pipeline
loop binding_name (
recursive_pipeline # that references **binding_name**
)
So for example our basic sequence example could be written as:
select n=1
loop seq (
from seq
select n+1
filter n<3
)
which would get translated into the following SQL:
WITH RECURSIVE table_0 AS (
SELECT 1 AS n
UNION ALL
SELECT n+1 FROM table_0 WHERE n<3
)
SELECT * FROM table_0;
i.e. the compiler replaces the binding_name seq
with whatever name gets generated for the CTE, table_0
in this case.
Assume you have a table nodes
of the form
CREATE TABLE nodes (
id int,
name varchar,
parent_id int NULL
);
To walk the tree from the root node you would use something like the following:
from nodes
filter parent_id==null
select [parent_name=null, child_name=name]
loop parent (
from parent
join child=nodes [child.parent_id==parent.id]
select [parent_name=parent.name, child_name=name]
)
Let's rewrite @teggy 's example using this new syntax.
The original example is as follows which has a nested WITH
which I understand to be there to get around a restriction in PostgreSQL that allows only one reference to the recursive table name inside the recursive query part.
WITH RECURSIVE
t(n) AS (
VALUES (1)
UNION ALL
(
WITH t AS (SELECT * FROM t)
SELECT t1.n + t2.n AS n
FROM t AS t1, t AS t2
WHERE t1.n < 256
)
)
SELECT * FROM t;
which produces the following output: | n |
---|---|
1 | |
2 | |
4 | |
8 | |
16 | |
32 | |
64 | |
128 | |
256 |
DuckDB allows a simpler form without the nested WITH
clause which is what I will use as my example because I don't know how to represent the nested WITH
in PRQL.
WITH RECURSIVE
t(n) AS (
VALUES (1)
UNION ALL
SELECT t1.n + t2.n AS n
FROM t AS t1, t AS t2
WHERE t1.n < 256
)
SELECT * FROM t;
The proposed PRQL form would be:
select n=1
loop t (
from t1=t
join side:full t2=t
select n = t1.n + t2.n
filter t1.n<256
)
I think that makes it clearer why you only get powers of 2, since it is only looping and there is no recursion, i.e. both t1
and t2
refer to the same bound set of rows and since there is always only one row in each iteration they always refer to the same value.
Unfortunately I didn't click through to that Twitter thread until now and wasn't aware that some databases now allow "non-linear", i.e. true(?) recursion. My proposed syntax above probably won't work for that.
I wonder whether it's worth keeping this simpler form for looping constructs (as in Clojure) since that's also what's most broadly supported, and then think of a different syntax/construct for the non-linear case? That would also allow the compiler to error out for dialects that don't support non-linear recursion.
group
transformI drew some inspiration from the group
transform when thinking about this. To wit, the general form of how we write these currently is as follows:
from relation_name
group [cols] (
group_transformation
)
While this is similar to how SQL handles GROUP BY
, I would argue that it actually hides the set of rows that the group_transform
is actually acting on. What I mean by that is that at a logical level we can imagine that the rows for each group are bound to a relation, call it group_relation
, and the form above is a shorthand for the following:
from relation_name
group [cols] group_relation (
from group_relation
group_transformation # referencing group_binding
)
For example if you look at how Pandas does groupby(...).apply(...), you'll see that it's essentially of the form df.groupby(...).apply(lambda grp_df: ...)
, i.e. for each group, the function argument supplied to apply
is called and passed a DataFrame argument grp_df
which contains all the rows in that group.
Most of the time you don't need the explicit reference to group_relation
, so it's convenient and ergonomic to leave it out. I have two questions arising out of that:
loop
construct?
recursive_pipeline
wouldn't have to begin with from binding_name
and if the binding_name
isn't referenced in the recursive_pipeline
at all then perhaps it wouldn't need to be specified at all and could just take on some default value. The simple sequence example could then be written as:
select n=1
loop (
select n+1
filter n<3
)
from default_name
step during the transpilation process.group
not have the same syntax as loop
and force or allow specifying a group_relation
binding name?
group
since these might begin with a from source_name
step.The proposed PRQL syntax is as follows:
initial_pipeline
loop distinct:{false,true} binding_name (
recursive_pipeline # that references **binding_name**
)
binding_name
optional.Thanks a lot for the impressively thorough design @snth
I very much encourage those who use recursive CTEs to comment β either your views on the design, or a use-case which would / wouldn't work well. Thanks in advance
This is a very nice write-up and documentation of language design process.
I really like the suggested design, because it:
std.loop
function,Your proposal for unifying with group syntax makes sense:
select n=1
loop ( # this pipeline is missing the leading
select n+1 # value which means that is evaluates
filter n<3 # to a function relation -> relation
)
As I see it, loop
basically takes two values: initial function and iteration function. It's just that in your proposal, iteration function is expressed with a binding name and a pipeline.
We could make this more general, by introducing syntax for lambda functions:
select n=1
loop (t ->
from t1=t
join side:full t2=t
select n = t1.n + t2.n
filter t1.n<256
)
I read most of the A Fix for the Fixation on Fixpoints by @Teggy and @kryonix last night and think that my Python description of how SQL Recursive CTE's work is essentially correct.
The paper gives a pseudocode equivalent of the SQL Recursive CTE semantics as:
where
final
in my description),intermediate
in my description),new
in my description),initial query
, andrecursive query
Or rewritten in Pyhon:
def loop(q, q1):
u = q1()
w = u
while True:
i = q(w)
if not i:
break
u = u + i
w = i
return u
and then compare that to my original Python description with the new symbols:
def loop(q, q1):
u = []
i = q1()
while not empty(i):
w = i
u = u + w
i = q(w)
return u
The rest of the paper describes proposed extensions to the SQL Recursive CTE semantics and syntax to allow expressing some algorithms more ergonomically as well as allowing the database engine to process these more efficiently.
Regarding non-linear recursion, the most relevant part of the paper for me was the following:
My takeaway from that is that it doesn't sound like it changes anything syntactically so the proposed PRQL syntax should still be adequate.
Co-author of A Fix for the Fixation of Fixpoints here.
It is a joy to see that recursive (iterative, really) query constructs are being considered for inclusion in PRQL. That decision is one of major consequences for the underlying language: in the case of SQL, it equipped the language with Turing-completeness (from then on, any computable function could be expressed in SQL β whether it should ..., well, that's different question altogether ;-).
Regarding (non-)linear recursion:
Most existing RDBMSs restrict the iterated part of recursive CTE (what Tobias calls q
above and the paper refers to as qβ
or q loop) to refer to the working table w
(or intermediate
, in Tobias' writeup) only once. This is the linear recursion restriction. Why that restriction? Nothing would keep SQL or q
to refer to w
twice or many times syntactically. (So, nothing would need to be changed in the PRQL syntax proposals above.)
The outcome of such a non-linear computation, however, can be counter-intuitive. The root cause is the fact that q
only sees the most recent results of the immediately preceding iteration when it refers to table w
(this treatment of w
has been designed as an optimization, termed semi-naive evaluation). There is a folklore example that demonstrates the unexpected outcome of such a non-linear query which can be found in F. Bancilhon and R. Ramakrishnan, An Amateurβs Introduction to Recursive Query Processing Strategies, ACM SIGMOD Record, 15(2):16β52, June 1986, at the end of Section 3.2.2 (page 27). The ancestor
example there is formulated in Datalog β if you find that syntax or the example non-accessible, let me know and I'll try to cook up the SQL equivalent. In a nutshell, the ancestor
query result will skip over (exponentially growing numbers) of generations of ancestors, rendering the result somewhat useless. That behaviour would be different if w
would always contain the rows of all preceding iterations.
What could be the consequences for PRQL?
You could say that the outcome of such a non-linear query is not wrong, it is just unexpected (and query developers could thus potentially trip over it). If developers use non-linear recursion, they should be really aware of what they're doing.
PRQL could syntactically forbid repeated references to w
β that is the route that most SQL RDBMSs chose to take. Personally, I find that irritating.
When a recursive is non-linear, abandon semi-naive evaluation and instead hold all rows of preceding iterations in w
. That will make the non-linear query behave as it expected but it changes the semantics of w
"behind the back". Ugh. MariaDB took that route, if I am not mistaken.
[Our take in A Fix for the Fixation of Fixpoints] Provide the query with control over how many previous iterations are remembered in w
(or: how many future iterations will a result row remain visible for qβ
in w
). That's experimental and one salient point of the paper. Obviously I am a big fan but β no doubt β that would be regarded as an exotic feature.
Very interested to see how recursion/iteration finally find its way into PRQL. Keep up the good work, folks!
As I see it, loop basically takes two values: initial function and iteration function. It's just that in your proposal, iteration function is expressed with a binding name and a pipeline.
We could make this more general, by introducing syntax for lambda functions:
Thanks @aljazerzen . That sounds like a beautiful generalisation.
The same would then also apply to the group
transform right?
Recursive CTEs have been a requested feature in all posts we had to date -not because people need it, but rather because as @Teggy noted, it makes SQL turing complete. Even if not really useful for everyday tasks, this issue does check a mark when considering PRQL adoption for a project.
It's great get advice of someone who has spent a great deal of time researching databases.
It seems that we should stick to linear recursion (or rather iteration) because we do need to target SQL right now and not many SQL dialects support non-linear recursive queries.
We could provide non-linear recursion via some other function, and clearly state supported dialects in the documentation. This can be done at a later stage.
The same would then also apply to the group transform right?
Yes it would. But it would throw some kind of error or even panic if you tried using input relation more than once:
from tracks
group genre_id (chunk ->
from chunk
join chunk
)
Very much agree with your comments regarding sticking to linear recursion for now and our proposed syntax would also show that this is iteration and might demystify "Recursive" CTEs for many people.
Recursive CTEs have been a requested feature in all posts we had to date -not because people need it, but rather because as @Teggy noted, it makes SQL turing complete. Even if not really useful for everyday tasks
The Turing Completeness is a great property to have but as @Teggy also noted, whether you should use that and whether SQL is the right tool for the job for general problems is another question.
More importantly though, I think people requested Recursive CTEs because they need them to solve real problems.
So I think it very much is a real need for people and goes beyond a nice to have theoretical property. I therefore fully support providing a solution to the iterative case which people need right now and deferring worrying about the general recursive case which is not widely supported by RDBMs anyway.
That's great regarding the group
construct.
As @Teggy noted, we should have a syntactic restriction or error on multiple uses of the input relation in the loop
case as well since that is also not supported by most RDBMs for Recursive CTEs.
A potentially rich source of Recursive CTEs: https://hub.getdbt.com/jpmmcneill/dbt_graph_theory/latest/
Example from: https://twitter.com/matsonj/status/1621031377616646144 (in turn from https://youtu.be/pVklntbGiNo?t=220).
First attempt at rewriting this in the proposed PRQL syntax. I think we don't have two of the constructs yet (ARRAY and EXISTS) so I have capitalised those:
from follows
join p1=person [p1.id==folllows.p1id, p1.name=='Bob']
select [startNode=p1id, endNode=p2id, path=ARRAY[p1id, p2id],]
loop (paths ->
from paths
join follows [paths.endNode==follows.p1id]
filter ! EXISTS (
from previous_paths=paths
join p2=person [p2.id==follows.p2id]
filter p2.name=='Bob' or follows.p2id==previous_paths.endNode
)
select [startNode=paths.startNode, endNode=p2id, path=array_append path p2id]
)
join p1=person [p1.id==startNode]
join p2=person [p2.id==endNode]
join city [city.id==p2.livesIn, city.name=='Utrecht']
A problem with the query above is that paths
shouldn't actually be exposed in the scope of the main pipeline because it's only defined in the lambda argument of loop
. That's why in the last 3 joins above I actually surpressed the references to paths
from the original query (and because I don't think we actually need) them.
Following the same semantics for defining aliases that we use in from alias=relation
and join alias=relation
, I suggest a slight tweak to the syntax that we discussed above, i.e. introducing an alias name into the loop
syntax as loop alias=(recursive_pipeline)
. With this the query would be:
from follows
join p1=person [p1.id==folllows.p1id, p1.name=='Bob']
select [startNode=p1id, endNode=p2id, path=ARRAY[p1id, p2id],]
loop paths=(ps ->
from ps
join follows [ps.endNode==follows.p1id]
filter ! EXISTS (
from previous_paths=ps
join p2=person [p2.id==follows.p2id]
filter p2.name=='Bob' or follows.p2id==previous_paths.endNode
)
select [startNode=ps.startNode, endNode=p2id, path=array_append path p2id]
)
join p1=person [p1.id==paths.startNode]
join p2=person [p2.id==paths.endNode]
join city [city.id==p2.livesIn, city.name=='Utrecht']
@aljazerzen what do you think? I think this is actually better to what we had before because it also demonstrates how the paths
table/relation is built up from the loop
.
I don't see how scope of paths is a problem. The result of loop
contains the three columns that you need:
[startNode, endNode, path]
From what I gather, your new proposal would expose the same thing, columns would just be named:
[paths.startNode, paths.endNode, paths.path]
Same effect can be achieved with a select, no?
Generally I like the idea of the new proposal, but I don't think it fits here. loop name=pipeline
would assign a name to the pipeline (a function), and not the relation as it does with from and join.
The loop
example from the book
from_text format:json '[{"n": 1 }]'
loop (
filter n<4
select n = n+1
)
produces the following SQL
WITH table_0 AS (
SELECT
1 AS n
),
table_4 AS (
WITH RECURSIVE loop AS (
SELECT
n
FROM
table_0 AS table_1
UNION
ALL
SELECT
n + 1
FROM
loop AS table_2
WHERE
n < 4
)
SELECT
*
FROM
loop
)
SELECT
n
FROM
table_4 AS table_3
-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)
The behaviour of the compiler that I have observed is that the Recursive CTE is always called "loop" and it is then wrapped in another CTE to provide a unique name. This seems inefficient in terms of the length of the SQL produced. Is the wrapping required in order to allow for the RECURSIVE? I noticed that in the DuckDB online shell some examples didn't work without it.
More importantly though it doesn't allow for referring back to the accumulating result set without referring to the _frame
which is an internal implementation detail that should be hidden from the user. (I'll try to produce a minimal example of this in a follow up.)
Therefore I propose that we implement a binding name as per my previous comment here: https://github.com/PRQL/prql/issues/407#issuecomment-1414402724
This can be optional (as in from
or join
) and default to "loop" in that case. Otherwise the name provided should be used.
With that the canonical example could be written as:
from_text format:json '[{"n": 1 }]'
loop loop_name=(
filter n<4
select n = n+1
)
and the SQL produced could be changed to:
WITH table_0 AS (
SELECT
1 AS n
),
table_4 AS (
WITH RECURSIVE loop_name AS (
SELECT
n
FROM
table_0 AS table_1
UNION
ALL
SELECT
n + 1
FROM
loop_name AS table_2
WHERE
n < 4
)
SELECT
*
FROM
loop_name
)
SELECT
n
FROM
table_4 AS table_3
-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)
The behaviour of the compiler that I have observed is that the Recursive CTE is always called "loop" and it is then wrapped in another CTE to provide a unique name. This seems inefficient in terms of the length of the SQL produced. Is the wrapping required in order to allow for the RECURSIVE? I noticed that in the DuckDB online shell some examples didn't work without it.
In SQL, RECURSIVE be used only directly after WITH keyword. This means that loop should always translate to the first of the WITH CTEs, which is not guaranteed to happen and actually almost never happens with real queries. So I constructed a CTE that contains a nested WITH clause, which can be recursive, since it is always the first one.
Quite verbose on SQL side, but it does work. We could try to optimize it, but I'm not sure if it is worth the effort.
Therefore I propose that we implement a binding name as per my previous comment here: https://github.com/PRQL/prql/issues/407#issuecomment-1414402724
I still don't quite understand what loop_name=
would mean here.
Which one of these two would it allow?
from_text format:json '[{"n": 1 }]'
loop loop_name=(
filter n<4
select n = n+1
)
select [loop_name.n]
... or:
from_text format:json '[{"n": 1 }]'
loop loop_name=(
filter n<4
join other_table [loop_name.n == other_table.n]
select n = n+1
)
In SQL, RECURSIVE be used only directly after WITH keyword. ... So I constructed a CTE that contains a nested WITH clause, which can be recursive, since it is always the first one.
Thanks, I only learned that today. Clever workaround, very nice! I also learned on the DuckDB Discord today (link) though that while it's true that the RECURSIVE keyword has to immediately follow the WITH keyword if there is a Recursive CTE present, it doesn't actually matter which, if any, of the CTEs is recursive though (at least this holds for DuckDB and Postgres). Moreover there doesn't appear to be any negative performance impact to having the RECURSIVE keyword when it's not required. Therefore it seems to me that this presents us with two options:
loop
present and then start the CTE clause with WITH RECURSIVE
instead of just WITH
,WITH RECURSIVE
instead of WITH
since there seems to be no downside to having it.Then my modified form of the loop
example from the book:
from_text format:json '[{"n": 1 }]'
loop loop_name=(
filter n<4
select n = n+1
)
could produce the following SQL:
WITH RECURSIVE table_0 AS (
SELECT
1 AS n
),
loop_name AS (
SELECT
n
FROM
table_0
UNION
ALL
SELECT
n + 1
FROM
loop_name
WHERE
n < 4
)
SELECT
n
FROM
loop_name
I've tested that SQL in both DuckDB and PostgreSQL and it works in both!
I still don't quite understand what loop_name= would mean here.
Which one of these two would it allow?
Both!
If you look at the example SQL for our example
WITH RECURSIVE table_0 AS (
SELECT
1 AS n
),
loop_name AS (
SELECT
n
FROM
table_0
UNION
ALL
SELECT
n + 1
FROM
loop_name
WHERE
n < 4
)
SELECT
n
FROM
loop_name
you see that loop_name
is both the name of the CTE which is accessible downstream from the Recursive CTE as well as the name of the results from the previous iteration in the update part of the CTE, i.e. inside the argument of loop
in our syntax.
Therefore both the PRQL loop
with loop_name
examples that you provided should work.
In some of the examples that I'm busy looking at and writing up, this is required. For example in the moving average calculation
let stream = (
from data
select [step=row_number, x=`Adj Close`]
sort [step]
)
from stream
filter step==1
select [step, x, avg=x]
loop (
join stream [stream.step==_frame.step+1]
select [stream.step, stream.x, avg=_frame.avg + (stream.x - _frame.avg) / stream.step]
)
sort [-step]
take 5
I'm having to resort to accessing frame
whereas with the loop_name
syntax we could write this as:
let stream = (
from data
select [step=row_number, x=`Adj Close`]
sort [step]
)
from stream
filter step==1
select [step, x, avg=x]
loop state=(
join stream [stream.step==state.step+1]
select [stream.step, stream.x, avg=state.avg + (stream.x - state.avg) / stream.step]
)
sort [-step]
take 5
if there is a Recursive CTE present, it doesn't actually matter which, if any, of the CTEs is recursive though
Woah, I did not expect that! It also appears to hold for MySQL and SQLite3.
doesn't appear to be any negative performance impact
You cannot measure performance impacts on these queries. You'd need to have at least 1MB of data and do a few re-runs to have a reliable level of certainty. Or an opinion from people who know how RECURSIVE is implemented in an engine.
But if it really does not have a performance impact, it wouldn't be hard to detect when RECURSIVE is needed and go with your first option.
Both!
I'm having to resort to accessing frame whereas with the loop_name syntax we could write this as:
I see, right.
I would concur if you meant only my first example (loop_name
is available after loop transform in the pipeline), but I don't think the behavior in my second example is a good idea. I have a few fiddly reasons all stemming from my interpretation of PRQL name resolution, but before we get into that:
What do you think about adding syntax for lambda functions to deal with this?
from_text format:json '[{"n": 1 }]'
loop (loop_name ->
loop_name
filter loop_name.n<4
select n = n+1
)
Sidenote:
If you look at the example SQL for our example
Let's try to refrain from using SQL implementation as a basis for PRQL language design. It's leads to thinking in terms for SQL, instead of relations.
The performance impact statement was a comment from Mark, one of the DuckDB core devs. It's in the DuckDB discord discussion I linked to above.
Update: BigQuery also allows "other than first recursive CTEs"
@snth What's left to do here? Lambda functions? Nicer SQL output?
I'll get back to this again soon.
@aljazerzen , because loop
defines a new, recursive relation, we need a way to reference that relation because of the recursive part. For the simple examples that are in our current documentation it works without that because unqualified columns reference the current frame. However this breaks down with join
, which is the most common use case for real-life recursive CTEs, because our current join
specification forces you to explicitly name the source relations.
So for example the following simple example doesn't work in the playground.
from employees
filter last_name=="Mitchell"
loop (
join manager=employees [manager.employee_id==reports_to]
select manager.*
)
Given compiler code changes and refactors since we last discuss this, how do you now feel about the loop name=(pipeline)
proposal? Do your concerns from https://github.com/PRQL/prql/issues/407#issuecomment-1464982902 still hold?
It would be great if we could get either this or the lambda syntax into 0.9 still then I can write up some examples with the release announcement.
I do still have reservations toward name=(pipeline)
.
The upside is that in the meantime, the syntax for lamdba functions was added!
from employees
filter last_name=="Mitchell"
loop (prev_step -> (
employee = prev_step
join manager=employees (employee.reports_to == manager.employee_id)
select manager.*
))
I know you are going to ask "why the employee = prev_step
" and my answer is "it's a consequence of current pipeline name inference rules".
So you're saying the following doesn't work:
from employees
filter last_name=="Mitchell"
loop (employee -> (
employee
join manager=employees (employee.reports_to == manager.employee_id)
select manager.*
))
I assume there is nothing special about the name prev_step
, just that you need one reassignment?
The following works on current main:
from employees
filter last_name=="Mitchell"
loop (
join manager=employees (manager.employee_id==_frame.reports_to)
select manager.*
)
and produces
WITH table_2 AS (
WITH RECURSIVE _loop AS (
SELECT
*
FROM
employees
WHERE
last_name = 'Mitchell'
UNION
ALL
SELECT
manager.*
FROM
_loop AS table_0
JOIN employees AS manager ON manager.employee_id = table_0.reports_to
)
SELECT
*
FROM
_loop
)
SELECT
*
FROM
table_2 AS table_1
-- Generated by PRQL compiler version:0.8.1 (https://prql-lang.org)
I'm not sure what's easier to explain, the use of _frame
or prev_step
.
I guess the benefit of _frame
is that you don't have to do any more work and I can just add some more documentation to the release.
(Incidentally, I keep having issues with this in the playground. I'll try again and open an issue if need be.)
So you're saying the following doesn't work:
Yes, exactly.
I assume there is nothing special about the name prev_step, just that you need one reassignment?
prev_step
is a variable that contains a relation, while employee
is a tuple within the relation.
But yes, _frame
is probably easier to explain.
I wouldn't want to document that as a part of the language - the _
screams "internal" or "unused". We could at least change the name to relation
or something.
Playground is on 0.8, it probably does not support most of this.
Just chiming in with the most urgently necessary usecase of this. The sqlite docs has a mandelbrot fractal display and a sudoku solver as two of the primary examples of CTE usage:
https://www.sqlite.org/draft/lang_with.html#outlandish_recursive_query_examples
Of course, having both window functions and recursive CTEs is what makes basic SQL turing complete, since you can implement a cyclic tag system with that. See slide 47 here: https://cdn.oreillystatic.com/en/assets/1/event/27/High%20Performance%20SQL%20with%20PostgreSQL%20Presentation.pdf
Hi @saolof ,
Thank you for the link to those examples. They are great!
This has been implemented as the loop
transform. See the docs here: https://prql-lang.org/book/reference/stdlib/transforms/loop.html
I also have some example demos in this Colab notebook: https://colab.research.google.com/drive/146ohCI_WlnOKpQQ6sPMWu4uKGqZvwaIZ#scrollTo=loctSkQYFA_w
Unfortunately there's a regression bug in latest version which we'll resolve. In the meantime you can pin to a previous version like I do in that notebook.
Edit: For an equally outlandish query, you can also look at my Calculate the digits of Pi with DuckDB and PRQL blogpost. π
Ah neat. Does loop as it exists now support dynamic programming and graph traversals?
A common usecase is when you have a table, and want to join it with an ancestor relation, but only have a parent relation in the DB because the ancestor relation would be prohibitively large. In those cases the dynamic aspect (being able to use union instead of union all) can be rather important to avoid double-counting paths.
Yes, the paragraph linked here https://colab.research.google.com/drive/146ohCI_WlnOKpQQ6sPMWu4uKGqZvwaIZ#scrollTo=loctSkQYFA_w goes through exactly that, variable depth tree traversal.
Good point about UNION vs UNION ALL. It uses UNION ALL by default. I can't recall offhand whether there's a parameter to ask for UNION instead. We should add that if not.
At my shop we do lot of ELT where the T happens in database. A typical use case is the following:
This has the benefit that everything happens in database, and if there is an error in the business logic we only need to refresh the materialized views. We love materialized views. Our E are literally scripts which do a single request and a single INSERT query. And when we need to dump the database, the only content are the JSON files content + definition, so our dumps are very light.
In our context some times we need iteration, for example, we have a feed of events, a single JSON document, we need to process those events accumulating some state, to process the message T[n], we need the state derived from T[1],...,T[n-1]. In functional programming languages this is called a
fold
.In postgres we use
WITH-RECURSIVE
queries, https://www.postgresql.org/docs/14/queries-with.html#QUERIES-WITH-RECURSIVEI can see a world where I would write the body of materialized views in
PRQL
.