PRQL / prql

PRQL is a modern language for transforming data β€” a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.89k stars 217 forks source link

`WITH RECURSIVE` based iteration #407

Open AntonioL opened 2 years ago

AntonioL commented 2 years ago

At my shop we do lot of ELT where the T happens in database. A typical use case is the following:

This has the benefit that everything happens in database, and if there is an error in the business logic we only need to refresh the materialized views. We love materialized views. Our E are literally scripts which do a single request and a single INSERT query. And when we need to dump the database, the only content are the JSON files content + definition, so our dumps are very light.

In our context some times we need iteration, for example, we have a feed of events, a single JSON document, we need to process those events accumulating some state, to process the message T[n], we need the state derived from T[1],...,T[n-1]. In functional programming languages this is called a fold.

In postgres we use WITH-RECURSIVE queries, https://www.postgresql.org/docs/14/queries-with.html#QUERIES-WITH-RECURSIVE

WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;

I can see a world where I would write the body of materialized views in PRQL.

aljazerzen commented 2 years ago

That's an interesting use of a database. The WITH-RECURSIVE could maybe be expressed with our s-string, but in place of pipeline during a table definition:

table t = s"VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100"

which could be packed into a function for easier use.

This would require #376 and a table flag to add the RECURSIVE keyword.

max-sixty commented 2 years ago

This would require #376 and a table flag to add the RECURSIVE keyword.

Possibly we could also have an s-string that doesn't bind to anything; it's just:

s"""
WITH RECURSIVE t(n) AS (
    VALUES (1)
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100
)
"""

...and then there's always an escape hatch.


I would probably put this feature into the "ensure compatible" bucket, rather than "support natively", but if there were lots of πŸ‘ s on the issue as PRQL becomes more used, then we would work on supporting it in PRQL itself.

AntonioL commented 2 years ago

What about adding the concept of a table-valued function which can be marked as explicitly recursive? In FSharp if a function is recursive is defined with let rec

Example:

let rec fa x = if x = 0 then x else x + fa (x-1)

I guess it can be a rec table. Since today you materialize table with a CTE expression. The only thing left would be to supply the starting items in the iteration.

In the above example it is the first part of the query VALUES(1).

But I want to stress that the above is not really "recursive", it is more iteration, wish it had a different term. However I like the escape hatch approach, this will allow us to defer any decisions on this matter.

snth commented 2 years ago

I picked up another example today from someone interested in PRQL that calls for recursive CTEs:

with recursive dim_balance_sheet(account_key, parent_account_key, account_name) as (
values
    (1221,null,'Balance Sheet'),
    (1272,1221,'Assets'),
    (1300,1272,'Current Assets'),
    (1301,1300,'Cash & Cash Equivalent'),
    (1302,1301,'Bank Accounts'),
    (1307,1300,'Inventories'),
    (1222,1221,'Equity and Liabilities'),
    (1223,1222,'Capital and Reserves'),
    (1236,1222,'Current Liabilities'),
    (1266,1236,'Deferred Tax'),
    (2093,1236,'Deposits'),
    (1380,1236,'Short Term Loans'),
    (1232,1222,'Non-Current Liabilities'),
    (1662,1232,'Long-term Loans'),
    (1224,1662,'Shareholders')),
 bs as (
        select account_key,
               account_name
          from dim_balance_sheet
         where parent_account_key is null
         union 
        select dim_balance_sheet.account_key,
               bs.account_name || ' > ' || dim_balance_sheet.account_name
          from dim_balance_sheet
          join bs
            on bs.account_key = dim_balance_sheet.parent_account_key
       )
select *
  from bs
 order by account_name
snth commented 2 years ago

These Recursive CTE tree traversals are quite common in my experience. Another common example is traversing organograms and employee management structures.

See this example for finding a Manager and Employee Hierarchy using the AdventureWorks database for example: SQL SERVER – Simple Example of Recursive CTE

Interesting that MS SQL Server doesn't need the recursive keyword. I've been using Postgresql for so long that I had forgotten that.

snth commented 2 years ago

Given the use cases for this, I think it would be good if we could support this.

@max-sixty @aljazerzen Does the compiler need to be smart about this or couldn't we simply have an optional rec keyword before the table function that makes it a rec table as suggested above (with the dialect awareness that for MS SQL Server for example rec becomes a no-op)?

I think we're currently also lacking VALUES and UNION ALL but that's kinda orthogonal to this and addressed elsewhere.

max-sixty commented 2 years ago

I think we could add a table recursive:true parameter to table. I admittedly haven't ever used RECURSIVE, but I can definitely understand the appeal.

I probably wouldn't prioritize it that highly without more signal from people who want to use it (good point re not having UNION!), but would be happy to see it as a contribution.

snth commented 2 years ago

Another example is in #716 in what appears to be a graph traversal algorithm.

Another example I have seen in my $dayjob is for security classification trees, so you start with portfolio at the top and then walk down through asset classes. industries, sectors, subsectors, ... . For multi-asset class funds you might want variable depths of the classification tree in different asset classes. It's very similar to the balance sheet example above just with different labels.

I think it's definitely lower priority than UNION (especially since it's dependent on it) but sounds like it shouldn't be that hard to do. I'll see if I can take a stab at it (since I have some time until UNION lands).

cfv1984 commented 1 year ago

Hi! Just chiming in with an example query I use to retrieve threaded topics on a side project of mine.

I don't believe this is currently possible with prql, but would be amazing if it did.

The following is sqlite, but I understand there are equivalents in most modern RDBMS

-- This is the recursive Common Table Expression 
WITH RECURSIVE 'POSTS_CTE' AS (
    SELECT 'p'.'id',
        'p'.'author_id',
        'p'.'parent_id',
        'p'.'content_id',
        (COALESCE(p.parent_id, "x") || ' > ' || p.id) AS path
    FROM 'posts' AS 'p'
    UNION
    SELECT 'p'.'id',
        'p'.'author_id',
        'p'.'parent_id',
        'p'.'content_id',
        (
            pr.path || ' > ' || p.parent_id || ' > ' || pr.id
        ) AS path
    FROM 'POSTS_CTE' AS 'pr'
        INNER JOIN 'posts' AS 'p' ON 'pr'.'id' = 'p'.'parent_id'
)
-- And this is a simple SELECT statement consuming it
SELECT 'pct'.'id' AS 'post_id',
    'pu'.'id' AS 'post_author_id',
    'pct'.'parent_id' AS 'post_parent_id',
    'cu'.'id' AS 'content_owner_id',
    'c'.'id' AS 'content_id',
    'pct'.'path' AS 'post_path',
    'cu'.'username' AS 'content_owner_username',
    'pu'.'username' AS 'post_author_username',
    'c'.'type' AS 'content_type',
    'p'.'title' AS 'post_title',
    'c'.'content' AS 'content_content'
FROM 'POSTS_CTE' AS 'pct'
    INNER JOIN 'content' AS 'c' on 'pct'.'content_id' = 'c'.'id'
    AND  'c'.'type' = 'topic'
    INNER JOIN 'users' AS 'cu' on 'c'.'owner_id' = 'cu'.'id'
    INNER JOIN 'users' AS 'pu' on 'pct'.'author_id' = 'pu'.'id'
    INNER JOIN 'posts' AS 'p' on 'pct'.'id' = 'p'.'id'
WHERE post_path LIKE "x > %"
ORDER BY post_parent_id, LENGTH('post_path') ASC
I use it to produce records of this shape: post_id post_author_id post_parent_id content_owner_id content_id post_path content_owner_username post_author_username content_type post_title content_content
4c5e55b2-59ae-42de-96a0-7ebc72827c7d b1caac6b-8761-411c-93c4-e62d91f61174 NULL b1caac6b-8761-411c-93c4-e62d91f61174 cdadc188-3bea-4799-a4a5-bf31b5ba63bd x > 4c5e55b2-59ae-42de-96a0-7ebc72827c7d moderator moderator topic moderator {"title":"moderator","personal":false}

Which I then use as base structure for a forum type platform, through a simple sqlite view (CREATE VIEW AS that whole thing described above).

As for the potential DX for such a feature, I'm not sure I'm in a position to choose? But I do find the S-string format proposed by max-sixty fairly interesting

aljazerzen commented 1 year ago

Thank you for posting your use-case. This will be useful when designing recursive functions/CTEs/something else that will handle these cases.

max-sixty commented 1 year ago

Good blog post from @frankmcsherry on this a couple of days ago! https://github.com/frankmcsherry/blog/blob/master/posts/2022-12-25.md

snth commented 1 year ago

Hey folks, I will be posting my PRQL syntax proposal soon. In the meantime, if you really want to go in-depth on Recursive CTEs, here's a link to an academic paper that appears to be hot off the press:

Tweet: https://twitter.com/Teggy/status/1612401148459159552 Paper: A Fix for the Fixation on Fixpoints

snth commented 1 year ago

Recursive CTEs for PRQL

Recursive CTEs in SQL

Thanks for the link to @frankmcsherry 's blog post @max-sixty. My mental model of Recursive CTEs much simpler and basically just involves iteration. I'm certainly not an expert on the matter and welcome any corrections. I haven't looked up the details for Recursive CTEs in years but I have written a couple over that timeframe and they always worked the way I expected, and to me it also makes sense why the first example in that article only produced powers of two.

To me the term "Recursive CTE" is actually a bit of a misnomer and it's really a type of iteration.

Let's first look at the basic form of a Recursive CTE in SQL and a basic example and then I will explain my mental model:

WITH [RECURSIVE] expression_name (column_list) AS (
    -- Anchor member 
    initial_query
    UNION ALL
    -- Recursive member that references expression_name.
    recursive_query
)
-- references expression name
SELECT * FROM expression_name

So the simplest example I can think of is pretty much the same as @AntonioL posted in the origin comment:

WITH RECURSIVE t AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n+1 FROM t WHERE n<3
)
SELECT * FROM t;
You can try this out right now in the DuckDB Web Shell. This produces n
1
2
3

Basically you can think of this as:

n = 1
yield n
while True:
    n = n + 1
    yield n
    if not n<3:
        break

The general case is the same but rather than a loop variable n you have a loop set and iteration stops when the loop set is empty.

So in my mental model in a Recursive CTE rows are moved through 3 sets: "new" -> "intermediate" -> "final".

Basically something like the following in Python (thanks @aljazerzen ):

def loop(recursive_call, initial_call):
    final = []
    new = initial_call()
    while not empty(new):
        intermediate = new
        final = final + intermediate
        new = recursive_call(intermediate)
    return final

Derivation of PRQL Recursive CTE Syntax

A straightforward translation of the SQL Recursive CTE syntax to PRQL would be the following:

table recursive relation_name = (
    initial_pipeline
    append (
        recursive_pipeline # that references relation_name
        # i.e. starts with
        # from relation_name
    )
)

from relation_name

However I think we can do better than that. In particular I don't like the following about the SQL inspired form above:

That second point in particular I got from this comment on the recent HackerNews thread Microfeatures I'd like to see in more languages. Based on that comment I looked up Clojure's loop/recur syntax. I found the description in this page Clojure Guide: How to use loop and recur quite informative and that inspired most of what follows.

I think that SQL's Recursive CTEs actually work quite similarly to Clojure's loop/recur construct which as the one HN comment points out gives you a loop construct with just expressions.

Clojure's loop construct has three parts:

In our case, the base case is that when the loop is restarted with an empty set as the rebinding then the iteration stops. Since this is always the same we don't need to specify it and can leave it out of our construct.

We therefore get a function signature for the loop transform that is something like the following:

func loop distinct<bool>:false binding_name recursive_pipeline initial_pipeline -> pipeline

Our general Recursive CTE in PRQL would then look something like:

table relation_name = (
    initial_pipeline
    loop binding_name (
        recursive_pipeline # that references **binding_name**
        # i.e. starts with
        # from binding_name
    )
)

from relation_name

Given that there is now no longer any reference to relation_name in the loop construct, I was thinking that we actually no longer need the table definition in PRQL and could probably write this as:

initial_pipeline
loop binding_name (
    recursive_pipeline # that references **binding_name**
)

A basic recursive CTE example

So for example our basic sequence example could be written as:

select n=1
loop seq (
    from seq
    select n+1
    filter n<3
)

which would get translated into the following SQL:

WITH RECURSIVE table_0 AS (
    SELECT 1 AS n
    UNION ALL
    SELECT n+1 FROM table_0 WHERE n<3
)
SELECT * FROM table_0;

i.e. the compiler replaces the binding_name seq with whatever name gets generated for the CTE, table_0 in this case.

A more realistic PRQL CTE example

Assume you have a table nodes of the form

CREATE TABLE nodes (
    id int,
    name varchar,
    parent_id int NULL
);

To walk the tree from the root node you would use something like the following:

from nodes
filter parent_id==null
select [parent_name=null, child_name=name]
loop parent (
    from parent
    join child=nodes [child.parent_id==parent.id]
    select [parent_name=parent.name, child_name=name]
)

Another example

Let's rewrite @teggy 's example using this new syntax.

The original example is as follows which has a nested WITH which I understand to be there to get around a restriction in PostgreSQL that allows only one reference to the recursive table name inside the recursive query part.

WITH RECURSIVE 
    t(n) AS (
        VALUES (1)
        UNION ALL
        (
            WITH t AS (SELECT * FROM t)
            SELECT t1.n + t2.n AS n
            FROM t AS t1, t AS t2
            WHERE t1.n < 256
        )
    )
SELECT * FROM t;
which produces the following output: n
1
2
4
8
16
32
64
128
256

DuckDB allows a simpler form without the nested WITH clause which is what I will use as my example because I don't know how to represent the nested WITH in PRQL.

WITH RECURSIVE 
    t(n) AS (
        VALUES (1)
        UNION ALL
        SELECT t1.n + t2.n AS n
        FROM t AS t1, t AS t2
        WHERE t1.n < 256
    )
SELECT * FROM t;

The proposed PRQL form would be:

select n=1
loop t (
    from t1=t
    join side:full t2=t
    select n = t1.n + t2.n
    filter t1.n<256
)

I think that makes it clearer why you only get powers of 2, since it is only looping and there is no recursion, i.e. both t1 and t2 refer to the same bound set of rows and since there is always only one row in each iteration they always refer to the same value.

Unfortunately I didn't click through to that Twitter thread until now and wasn't aware that some databases now allow "non-linear", i.e. true(?) recursion. My proposed syntax above probably won't work for that.

I wonder whether it's worth keeping this simpler form for looping constructs (as in Clojure) since that's also what's most broadly supported, and then think of a different syntax/construct for the non-linear case? That would also allow the compiler to error out for dialects that don't support non-linear recursion.

Other Considerations

Comparison with the group transform

I drew some inspiration from the group transform when thinking about this. To wit, the general form of how we write these currently is as follows:

from relation_name
group [cols] (
    group_transformation
)

While this is similar to how SQL handles GROUP BY, I would argue that it actually hides the set of rows that the group_transform is actually acting on. What I mean by that is that at a logical level we can imagine that the rows for each group are bound to a relation, call it group_relation, and the form above is a shorthand for the following:

from relation_name
group [cols] group_relation (
    from group_relation
    group_transformation    # referencing group_binding
)

For example if you look at how Pandas does groupby(...).apply(...), you'll see that it's essentially of the form df.groupby(...).apply(lambda grp_df: ...), i.e. for each group, the function argument supplied to apply is called and passed a DataFrame argument grp_df which contains all the rows in that group.

Most of the time you don't need the explicit reference to group_relation, so it's convenient and ergonomic to leave it out. I have two questions arising out of that:

Conclusion

The proposed PRQL syntax is as follows:

initial_pipeline
loop distinct:{false,true} binding_name (
    recursive_pipeline # that references **binding_name**
)

Further work:

max-sixty commented 1 year ago

Thanks a lot for the impressively thorough design @snth

I very much encourage those who use recursive CTEs to comment β€” either your views on the design, or a use-case which would / wouldn't work well. Thanks in advance

aljazerzen commented 1 year ago

This is a very nice write-up and documentation of language design process.

I really like the suggested design, because it:

Your proposal for unifying with group syntax makes sense:

select n=1
loop (          # this pipeline is missing the leading
    select n+1  # value which means that is evaluates
    filter n<3  # to a function relation -> relation
)

As I see it, loop basically takes two values: initial function and iteration function. It's just that in your proposal, iteration function is expressed with a binding name and a pipeline.

We could make this more general, by introducing syntax for lambda functions:

select n=1
loop (t ->
    from t1=t
    join side:full t2=t
    select n = t1.n + t2.n
    filter t1.n<256
)
snth commented 1 year ago

I read most of the A Fix for the Fixation on Fixpoints by @Teggy and @kryonix last night and think that my Python description of how SQL Recursive CTE's work is essentially correct.

The paper gives a pseudocode equivalent of the SQL Recursive CTE semantics as: image

where

Or rewritten in Pyhon:

def loop(q, q1):
    u = q1()
    w = u
    while True:
        i = q(w)
        if not i:
            break
        u = u + i
        w = i
    return u

and then compare that to my original Python description with the new symbols:

def loop(q, q1):
    u = []
    i = q1()
    while not empty(i):
        w = i
        u = u + w
        i = q(w)
    return u

The rest of the paper describes proposed extensions to the SQL Recursive CTE semantics and syntax to allow expressing some algorithms more ergonomically as well as allowing the database engine to process these more efficiently.

Regarding non-linear recursion, the most relevant part of the paper for me was the following:

image

My takeaway from that is that it doesn't sound like it changes anything syntactically so the proposed PRQL syntax should still be adequate.

Teggy commented 1 year ago

Co-author of A Fix for the Fixation of Fixpoints here.

It is a joy to see that recursive (iterative, really) query constructs are being considered for inclusion in PRQL. That decision is one of major consequences for the underlying language: in the case of SQL, it equipped the language with Turing-completeness (from then on, any computable function could be expressed in SQL β€” whether it should ..., well, that's different question altogether ;-).

Regarding (non-)linear recursion:

Most existing RDBMSs restrict the iterated part of recursive CTE (what Tobias calls q above and the paper refers to as q∞or q loop) to refer to the working table w (or intermediate, in Tobias' writeup) only once. This is the linear recursion restriction. Why that restriction? Nothing would keep SQL or q to refer to w twice or many times syntactically. (So, nothing would need to be changed in the PRQL syntax proposals above.)

The outcome of such a non-linear computation, however, can be counter-intuitive. The root cause is the fact that q only sees the most recent results of the immediately preceding iteration when it refers to table w (this treatment of w has been designed as an optimization, termed semi-naive evaluation). There is a folklore example that demonstrates the unexpected outcome of such a non-linear query which can be found in F. Bancilhon and R. Ramakrishnan, An Amateur’s Introduction to Recursive Query Processing Strategies, ACM SIGMOD Record, 15(2):16–52, June 1986, at the end of Section 3.2.2 (page 27). The ancestor example there is formulated in Datalog β€” if you find that syntax or the example non-accessible, let me know and I'll try to cook up the SQL equivalent. In a nutshell, the ancestor query result will skip over (exponentially growing numbers) of generations of ancestors, rendering the result somewhat useless. That behaviour would be different if w would always contain the rows of all preceding iterations.

What could be the consequences for PRQL?

Very interested to see how recursion/iteration finally find its way into PRQL. Keep up the good work, folks!

snth commented 1 year ago

As I see it, loop basically takes two values: initial function and iteration function. It's just that in your proposal, iteration function is expressed with a binding name and a pipeline.

We could make this more general, by introducing syntax for lambda functions:

Thanks @aljazerzen . That sounds like a beautiful generalisation.

The same would then also apply to the group transform right?

aljazerzen commented 1 year ago

Recursive CTEs have been a requested feature in all posts we had to date -not because people need it, but rather because as @Teggy noted, it makes SQL turing complete. Even if not really useful for everyday tasks, this issue does check a mark when considering PRQL adoption for a project.

It's great get advice of someone who has spent a great deal of time researching databases.

It seems that we should stick to linear recursion (or rather iteration) because we do need to target SQL right now and not many SQL dialects support non-linear recursive queries.

We could provide non-linear recursion via some other function, and clearly state supported dialects in the documentation. This can be done at a later stage.


The same would then also apply to the group transform right?

Yes it would. But it would throw some kind of error or even panic if you tried using input relation more than once:

from tracks
group genre_id (chunk ->
    from chunk
    join chunk
)
snth commented 1 year ago

Very much agree with your comments regarding sticking to linear recursion for now and our proposed syntax would also show that this is iteration and might demystify "Recursive" CTEs for many people.

Recursive CTEs have been a requested feature in all posts we had to date -not because people need it, but rather because as @Teggy noted, it makes SQL turing complete. Even if not really useful for everyday tasks

The Turing Completeness is a great property to have but as @Teggy also noted, whether you should use that and whether SQL is the right tool for the job for general problems is another question.

More importantly though, I think people requested Recursive CTEs because they need them to solve real problems.

So I think it very much is a real need for people and goes beyond a nice to have theoretical property. I therefore fully support providing a solution to the iterative case which people need right now and deferring worrying about the general recursive case which is not widely supported by RDBMs anyway.


That's great regarding the group construct.

As @Teggy noted, we should have a syntactic restriction or error on multiple uses of the input relation in the loop case as well since that is also not supported by most RDBMs for Recursive CTEs.

snth commented 1 year ago

A potentially rich source of Recursive CTEs: https://hub.getdbt.com/jpmmcneill/dbt_graph_theory/latest/

snth commented 1 year ago

Example from: https://twitter.com/matsonj/status/1621031377616646144 (in turn from https://youtu.be/pVklntbGiNo?t=220).

image

First attempt at rewriting this in the proposed PRQL syntax. I think we don't have two of the constructs yet (ARRAY and EXISTS) so I have capitalised those:

from follows
join p1=person [p1.id==folllows.p1id, p1.name=='Bob']
select [startNode=p1id, endNode=p2id, path=ARRAY[p1id, p2id],]
loop (paths ->
    from paths
    join follows [paths.endNode==follows.p1id]
    filter ! EXISTS (
        from previous_paths=paths
        join p2=person [p2.id==follows.p2id]
        filter p2.name=='Bob' or follows.p2id==previous_paths.endNode
        )
    select [startNode=paths.startNode, endNode=p2id, path=array_append path p2id]
    )
join p1=person [p1.id==startNode]
join p2=person [p2.id==endNode]
join city [city.id==p2.livesIn, city.name=='Utrecht']
snth commented 1 year ago

A problem with the query above is that paths shouldn't actually be exposed in the scope of the main pipeline because it's only defined in the lambda argument of loop. That's why in the last 3 joins above I actually surpressed the references to paths from the original query (and because I don't think we actually need) them.

Following the same semantics for defining aliases that we use in from alias=relation and join alias=relation, I suggest a slight tweak to the syntax that we discussed above, i.e. introducing an alias name into the loop syntax as loop alias=(recursive_pipeline). With this the query would be:

from follows
join p1=person [p1.id==folllows.p1id, p1.name=='Bob']
select [startNode=p1id, endNode=p2id, path=ARRAY[p1id, p2id],]
loop paths=(ps ->
    from ps
    join follows [ps.endNode==follows.p1id]
    filter ! EXISTS (
        from previous_paths=ps
        join p2=person [p2.id==follows.p2id]
        filter p2.name=='Bob' or follows.p2id==previous_paths.endNode
        )
    select [startNode=ps.startNode, endNode=p2id, path=array_append path p2id]
    )
join p1=person [p1.id==paths.startNode]
join p2=person [p2.id==paths.endNode]
join city [city.id==p2.livesIn, city.name=='Utrecht']

@aljazerzen what do you think? I think this is actually better to what we had before because it also demonstrates how the paths table/relation is built up from the loop.

aljazerzen commented 1 year ago

I don't see how scope of paths is a problem. The result of loop contains the three columns that you need:

[startNode, endNode, path]

From what I gather, your new proposal would expose the same thing, columns would just be named:

[paths.startNode, paths.endNode, paths.path]

Same effect can be achieved with a select, no?

aljazerzen commented 1 year ago

Generally I like the idea of the new proposal, but I don't think it fits here. loop name=pipeline would assign a name to the pipeline (a function), and not the relation as it does with from and join.

snth commented 1 year ago

The loop example from the book

from_text format:json '[{"n": 1 }]'
loop (
    filter n<4
    select n = n+1
)

produces the following SQL

WITH table_0 AS (
  SELECT
    1 AS n
),
table_4 AS (
  WITH RECURSIVE loop AS (
    SELECT
      n
    FROM
      table_0 AS table_1
    UNION
    ALL
    SELECT
      n + 1
    FROM
      loop AS table_2
    WHERE
      n < 4
  )
  SELECT
    *
  FROM
    loop
)
SELECT
  n
FROM
  table_4 AS table_3

-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)

The behaviour of the compiler that I have observed is that the Recursive CTE is always called "loop" and it is then wrapped in another CTE to provide a unique name. This seems inefficient in terms of the length of the SQL produced. Is the wrapping required in order to allow for the RECURSIVE? I noticed that in the DuckDB online shell some examples didn't work without it.

More importantly though it doesn't allow for referring back to the accumulating result set without referring to the _frame which is an internal implementation detail that should be hidden from the user. (I'll try to produce a minimal example of this in a follow up.)

Therefore I propose that we implement a binding name as per my previous comment here: https://github.com/PRQL/prql/issues/407#issuecomment-1414402724 This can be optional (as in from or join) and default to "loop" in that case. Otherwise the name provided should be used.

With that the canonical example could be written as:

from_text format:json '[{"n": 1 }]'
loop loop_name=(
    filter n<4
    select n = n+1
)

and the SQL produced could be changed to:

WITH table_0 AS (
  SELECT
    1 AS n
),
table_4 AS (
  WITH RECURSIVE loop_name AS (
    SELECT
      n
    FROM
      table_0 AS table_1
    UNION
    ALL
    SELECT
      n + 1
    FROM
      loop_name AS table_2
    WHERE
      n < 4
  )
  SELECT
    *
  FROM
    loop_name
)
SELECT
  n
FROM
  table_4 AS table_3

-- Generated by PRQL compiler version:0.6.0 (https://prql-lang.org)
aljazerzen commented 1 year ago

The behaviour of the compiler that I have observed is that the Recursive CTE is always called "loop" and it is then wrapped in another CTE to provide a unique name. This seems inefficient in terms of the length of the SQL produced. Is the wrapping required in order to allow for the RECURSIVE? I noticed that in the DuckDB online shell some examples didn't work without it.

In SQL, RECURSIVE be used only directly after WITH keyword. This means that loop should always translate to the first of the WITH CTEs, which is not guaranteed to happen and actually almost never happens with real queries. So I constructed a CTE that contains a nested WITH clause, which can be recursive, since it is always the first one.

Quite verbose on SQL side, but it does work. We could try to optimize it, but I'm not sure if it is worth the effort.


Therefore I propose that we implement a binding name as per my previous comment here: https://github.com/PRQL/prql/issues/407#issuecomment-1414402724

I still don't quite understand what loop_name= would mean here.

Which one of these two would it allow?

from_text format:json '[{"n": 1 }]'
loop loop_name=(
    filter n<4
    select n = n+1
)
select [loop_name.n]

... or:

from_text format:json '[{"n": 1 }]'
loop loop_name=(
    filter n<4
    join other_table [loop_name.n == other_table.n]
    select n = n+1
)
snth commented 1 year ago

In SQL, RECURSIVE be used only directly after WITH keyword. ... So I constructed a CTE that contains a nested WITH clause, which can be recursive, since it is always the first one.

Thanks, I only learned that today. Clever workaround, very nice! I also learned on the DuckDB Discord today (link) though that while it's true that the RECURSIVE keyword has to immediately follow the WITH keyword if there is a Recursive CTE present, it doesn't actually matter which, if any, of the CTEs is recursive though (at least this holds for DuckDB and Postgres). Moreover there doesn't appear to be any negative performance impact to having the RECURSIVE keyword when it's not required. Therefore it seems to me that this presents us with two options:

Then my modified form of the loop example from the book:

from_text format:json '[{"n": 1 }]'
loop loop_name=(
    filter n<4
    select n = n+1
)

could produce the following SQL:

WITH RECURSIVE table_0 AS (
  SELECT
    1 AS n
),
loop_name AS (
    SELECT
      n
    FROM
      table_0
    UNION
    ALL
    SELECT
      n + 1
    FROM
      loop_name
    WHERE
      n < 4
)
SELECT
    n
FROM
    loop_name

I've tested that SQL in both DuckDB and PostgreSQL and it works in both!

snth commented 1 year ago

I still don't quite understand what loop_name= would mean here.

Which one of these two would it allow?

Both!

If you look at the example SQL for our example

WITH RECURSIVE table_0 AS (
  SELECT
    1 AS n
),
loop_name AS (
    SELECT
      n
    FROM
      table_0
    UNION
    ALL
    SELECT
      n + 1
    FROM
      loop_name
    WHERE
      n < 4
)
SELECT
    n
FROM
    loop_name

you see that loop_name is both the name of the CTE which is accessible downstream from the Recursive CTE as well as the name of the results from the previous iteration in the update part of the CTE, i.e. inside the argument of loop in our syntax.

Therefore both the PRQL loop with loop_name examples that you provided should work.

In some of the examples that I'm busy looking at and writing up, this is required. For example in the moving average calculation

let stream = (
    from data
    select [step=row_number, x=`Adj Close`]
    sort [step]
)

from stream
filter step==1
select [step, x, avg=x]
loop (
    join stream [stream.step==_frame.step+1]
    select [stream.step, stream.x, avg=_frame.avg + (stream.x - _frame.avg) / stream.step]
)
sort [-step]
take 5

I'm having to resort to accessing frame whereas with the loop_name syntax we could write this as:

let stream = (
    from data
    select [step=row_number, x=`Adj Close`]
    sort [step]
)

from stream
filter step==1
select [step, x, avg=x]
loop state=(
    join stream [stream.step==state.step+1]
    select [stream.step, stream.x, avg=state.avg + (stream.x - state.avg) / stream.step]
)
sort [-step]
take 5
aljazerzen commented 1 year ago

if there is a Recursive CTE present, it doesn't actually matter which, if any, of the CTEs is recursive though

Woah, I did not expect that! It also appears to hold for MySQL and SQLite3.

doesn't appear to be any negative performance impact

You cannot measure performance impacts on these queries. You'd need to have at least 1MB of data and do a few re-runs to have a reliable level of certainty. Or an opinion from people who know how RECURSIVE is implemented in an engine.

But if it really does not have a performance impact, it wouldn't be hard to detect when RECURSIVE is needed and go with your first option.

aljazerzen commented 1 year ago

Both!

I'm having to resort to accessing frame whereas with the loop_name syntax we could write this as:

I see, right.

I would concur if you meant only my first example (loop_name is available after loop transform in the pipeline), but I don't think the behavior in my second example is a good idea. I have a few fiddly reasons all stemming from my interpretation of PRQL name resolution, but before we get into that:

What do you think about adding syntax for lambda functions to deal with this?

from_text format:json '[{"n": 1 }]'
loop (loop_name ->
    loop_name
    filter loop_name.n<4
    select n = n+1
)

Sidenote:

If you look at the example SQL for our example

Let's try to refrain from using SQL implementation as a basis for PRQL language design. It's leads to thinking in terms for SQL, instead of relations.

snth commented 1 year ago

The performance impact statement was a comment from Mark, one of the DuckDB core devs. It's in the DuckDB discord discussion I linked to above.

aljazerzen commented 1 year ago

Update: BigQuery also allows "other than first recursive CTEs"

aljazerzen commented 1 year ago

@snth What's left to do here? Lambda functions? Nicer SQL output?

snth commented 1 year ago

I'll get back to this again soon.

snth commented 1 year ago

@aljazerzen , because loop defines a new, recursive relation, we need a way to reference that relation because of the recursive part. For the simple examples that are in our current documentation it works without that because unqualified columns reference the current frame. However this breaks down with join, which is the most common use case for real-life recursive CTEs, because our current join specification forces you to explicitly name the source relations.

So for example the following simple example doesn't work in the playground.

from employees
filter last_name=="Mitchell"
loop (
  join manager=employees [manager.employee_id==reports_to]
  select manager.*
)

Given compiler code changes and refactors since we last discuss this, how do you now feel about the loop name=(pipeline) proposal? Do your concerns from https://github.com/PRQL/prql/issues/407#issuecomment-1464982902 still hold?

It would be great if we could get either this or the lambda syntax into 0.9 still then I can write up some examples with the release announcement.

aljazerzen commented 1 year ago

I do still have reservations toward name=(pipeline).

The upside is that in the meantime, the syntax for lamdba functions was added!

from employees
filter last_name=="Mitchell"
loop (prev_step -> (
  employee = prev_step
  join manager=employees (employee.reports_to == manager.employee_id)
  select manager.*
))

I know you are going to ask "why the employee = prev_step" and my answer is "it's a consequence of current pipeline name inference rules".

snth commented 1 year ago

So you're saying the following doesn't work:

from employees
filter last_name=="Mitchell"
loop (employee -> (
  employee
  join manager=employees (employee.reports_to == manager.employee_id)
  select manager.*
))

I assume there is nothing special about the name prev_step, just that you need one reassignment?


The following works on current main:

from employees
filter last_name=="Mitchell"
loop (
  join manager=employees (manager.employee_id==_frame.reports_to)
  select manager.*
)

and produces

WITH table_2 AS (
  WITH RECURSIVE _loop AS (
    SELECT
      *
    FROM
      employees
    WHERE
      last_name = 'Mitchell'
    UNION
    ALL
    SELECT
      manager.*
    FROM
      _loop AS table_0
      JOIN employees AS manager ON manager.employee_id = table_0.reports_to
  )
  SELECT
    *
  FROM
    _loop
)
SELECT
  *
FROM
  table_2 AS table_1

-- Generated by PRQL compiler version:0.8.1 (https://prql-lang.org)

I'm not sure what's easier to explain, the use of _frame or prev_step.

I guess the benefit of _frame is that you don't have to do any more work and I can just add some more documentation to the release.


(Incidentally, I keep having issues with this in the playground. I'll try again and open an issue if need be.)

aljazerzen commented 1 year ago

So you're saying the following doesn't work:

Yes, exactly.

I assume there is nothing special about the name prev_step, just that you need one reassignment?

prev_step is a variable that contains a relation, while employee is a tuple within the relation.

But yes, _frame is probably easier to explain.

I wouldn't want to document that as a part of the language - the _ screams "internal" or "unused". We could at least change the name to relation or something.


Playground is on 0.8, it probably does not support most of this.

saolof commented 1 month ago

Just chiming in with the most urgently necessary usecase of this. The sqlite docs has a mandelbrot fractal display and a sudoku solver as two of the primary examples of CTE usage:

https://www.sqlite.org/draft/lang_with.html#outlandish_recursive_query_examples

Of course, having both window functions and recursive CTEs is what makes basic SQL turing complete, since you can implement a cyclic tag system with that. See slide 47 here: https://cdn.oreillystatic.com/en/assets/1/event/27/High%20Performance%20SQL%20with%20PostgreSQL%20Presentation.pdf

snth commented 1 month ago

Hi @saolof ,

Thank you for the link to those examples. They are great!

This has been implemented as the loop transform. See the docs here: https://prql-lang.org/book/reference/stdlib/transforms/loop.html

I also have some example demos in this Colab notebook: https://colab.research.google.com/drive/146ohCI_WlnOKpQQ6sPMWu4uKGqZvwaIZ#scrollTo=loctSkQYFA_w

Unfortunately there's a regression bug in latest version which we'll resolve. In the meantime you can pin to a previous version like I do in that notebook.

Edit: For an equally outlandish query, you can also look at my Calculate the digits of Pi with DuckDB and PRQL blogpost. πŸ˜€

saolof commented 1 month ago

Ah neat. Does loop as it exists now support dynamic programming and graph traversals?

A common usecase is when you have a table, and want to join it with an ancestor relation, but only have a parent relation in the DB because the ancestor relation would be prohibitively large. In those cases the dynamic aspect (being able to use union instead of union all) can be rather important to avoid double-counting paths.

snth commented 1 month ago

Yes, the paragraph linked here https://colab.research.google.com/drive/146ohCI_WlnOKpQQ6sPMWu4uKGqZvwaIZ#scrollTo=loctSkQYFA_w goes through exactly that, variable depth tree traversal.

Good point about UNION vs UNION ALL. It uses UNION ALL by default. I can't recall offhand whether there's a parameter to ask for UNION instead. We should add that if not.