PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.66k stars 209 forks source link

Add sampling extension support to PRQL #4425

Open hsight opened 2 months ago

hsight commented 2 months ago

What's up?

It would be great to handle table sampling in PRQL.

Examples:

kgutwin commented 2 months ago

I really like this idea, it could come in handy!

I wondered if it would be possible to approximate this using a PRQL function; that might allow you to try it out before it's implemented natively within PRQL. I put this function together which works! (sorta... see below)

let sample = func
  nrows <float>
  tbl <relation>
  -> <relation> (
  from s"SELECT * FROM {tbl} USING SAMPLE {nrows}"
)

from invoices
sample 20
select { customer_id }

This outputs the following SQL:

WITH table_0 AS (
  SELECT
    *
  FROM
    invoices USING SAMPLE 20
)
SELECT
  customer_id
FROM
  table_0

-- Generated by PRQL compiler version:0.11.3 (https://prql-lang.org)

However, I start to get strange errors if I modify the pipeline preceding the sample call. If I move the select { customer_id } before the sample 20, I get the error:

from invoices
select { customer_id }
sample 20

Error: 
   ╭─[:9:1]
   │
 9 │ select { customer_id }
   │ ───────────┬──────────  
   │            ╰──────────── unexpected ``(Select ...)``
   │ 
   │ Help: this is probably a 'bad type' error (we are working on that)
───╯

Perhaps I'm using the S-string wrong somehow... does anyone have suggestions?

max-sixty commented 2 months ago

@kgutwin great use of s-strings!

The one change to fix that error is:

let sample = func
-  nrows <float>
+  nrows <int>
  tbl <relation>
  -> <relation> (
  from s"SELECT * FROM {tbl} USING SAMPLE {nrows}"
)

I agree sample is useful. There are a lot of options for sample in DuckDB, which makes it a bit harder to add (though still tractable...). Possibly we could add it to the docs as a nice example of s-strings initially.

Would be very open to a PR for either!

aljazerzen commented 2 months ago

We need to do a survey of supported features in these databases and come up with an interface that will allow (eventually) supporting all of those features.