Open eitsupi opened 1 year ago
Excellent write up @eitsupi !
~I'd split these sorts of functions — we can called them Abstraction Over Columns — into two broad categories:~
SELECT price_* from foo
~WHERE COLUMNS(*) IS NOT NULL
~(Edit: I don't think these are that different on reflection)
I think the requirements are similar — either PRQL needs to know the column names itself, or the DB needs to support the abstraction. Without a DB abstraction, the SQL for Names is fairly concise, the SQL for Values is quite verbose, since it requires listing a set of conditions.~
Overall, I'm open to these. My take is that they're more complicated than the standard PRQL because of those constraints. When we have DB cohesion, they'll be more powerful, since providing the input columns won't be necessary. My focus for the project would still be to make the basic language more robust (we're still finding occasional bugs, many of which remain!), but there's no reason we can't do these in parallel.
What are others' thoughts?
I quite like the way DuckDB does this with COLUMNS(*)
. It seems quite concise and practical. The main downside to me is that it's a bit magical and I would like whatever we come up with to be more explicit.
Not having thought about it too much yet, to me it seems that this veers into the territory of macros. I was thinking that we will probably need those anyway as I was thinking about them in the context of functions generalising over idents (maybe that can be done differently but that's how I've been thinking about it so far).
I'd split these sorts of functions — we can called them Abstraction Over Columns — into two broad categories:
- Abstracting over the column names — e.g. SELECT price_* from foo
- Abstracting over column values — e.g. WHERE COLUMNS(*) IS NOT NULL
Are these really different? My understanding from what I've read about COLUMNS
in the DuckDB docs is that this really behaves like a macro expansion. So assuming we have a table foo
with many columns, including three named price_a
, price_b
and price_c
, then
SELECT MIN(COLUMNS(price_*)) FROM foo
becomes SELECT MIN(price_a), MIN(price_b), MIN(price_c) FROM foo
, andWHERE COLUMNS(price_*) IS NOT NULL
becomes WHERE price_a IS NOT NULL AND price_b IS NOT NULL AND price_c IS NOT NULL
The part of this that seems a bit magical to me is how does it know that the expansion stops at the MIN
in the SELECT
and in the WHERE
clause example, why are these combined with AND
and not OR
?
I would propose implementing this with a macro named columns!
using a !
to denote the macro as in Rust. Given our lack of parentheses, it's not clear to me how the compiler knows which parts to replicate. So the first example would be something like
select [min columns! price_*]
Is that clear enough or does it need parentheses?
select [min (columns! price_*)]
What if you have multiple functions?
select [columns! price_* | abs | min, columns! price_* | abs | max]
Could this be written better?
Looking at the second case
filter (columns! price_*) != null
I'm not sure about any of this. Just exploring some ideas really. Thoughts?
I think the requirements are similar — either PRQL needs to know the column names itself, or the DB needs to support the abstraction.
Agree with this. Without knowledge of the schema, PRQL could only translate to the DuckDB syntax (or equivalent) where that's available. Once we are able to operate on a cached schema definition then we could do the column name expansions at compilation time.
Are these really different?
No, , on reflection, not really! I've updated my comment. Thanks.
What's the difference between a macro and a function? A function operates on a column, and a macro generates column names?
I would vote quite strongly against introducing a totally new language concept like macros given the current state of the language & project — adding this will add to the maintenance and complexity burden. (I'm a big +1 on things like the module system & type system, I'm not saying "no new features" — but we should be balancing the cost of new features with our confidence that they're going to have make the language more useful)
In Rust, difference is that macros operate on syntactic level, before any type checking or anything semantic. But this is already very close to our functions.
We may be able to get this working without new syntax:
[price_*]
is expanded into [price_a, price_b, price_c]
,map
:
aggregate ([price_a, price_b, price_c] | map min)
any
:
filter (any [false, true, false])
filter ([price_*] | map (x -> x != null) | any)
This does require more parenthesis and knowledge of functional programming, but it is not magic.
For beginners, we could have a snippet to copy-paste / library / package that would contain useful stuff like:
func any_is_not_null cols -> (cols | map (x -> x != null) | any)
FYI, I was reading the source code of polars and know that there is starts_with
and ends_with
in polars SQL.
https://github.com/pola-rs/polars/blob/8c09c85c782dfde18e6c2cb0ce0644d433dd887c/polars/polars-sql/src/functions.rs#L104-L115
SELECT STARTS_WITH(column_1, 'a') from df;
SELECT column_2 from df WHERE ENDS_WITH(column_1, 'a');
I'd guess that STARTS_WITH
is checking if the string in column_1 starts with 'a'
and does not include all columns that start with column_a
.
Am I missing something, why is this relevant?
Am I missing something, why is this relevant?
My apologies, my misunderstanding!
Column selection with tidyselect is a very powerful feature of dplyr. https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html
Similar features have recently been introduced in ibis. https://github.com/ibis-project/ibis/pull/5307
There is also a dbt package that mimics this functionality. https://github.com/emilyriederer/dbtplyr
I believe that many functions of tidyselect can be realized only when the schema is obvious, but I know that DuckDB has
COLUMNS()
and some functions can be realized (duckdb/duckdb#6621). So I am wondering if it is possible to partially implement this in PRQL likeselect ![foo]
->SELECT * EXCLUDE foo
.The following examples use dplyr and duckdb on R. DuckDB is the unreleased edge version (0.8.0). (Installed from R-universe)
Filter
The most common use would be to remove rows that contain null in either column.
The tidyverse also has a dedicated function called
tidyr::drop_na
, but as a result of the introduction ofdplyr::across
in dplyr 1.0, it can be executed by dplyr alone. (if_all
andif_any
are functions derived fromacross
)Using dbplyr, we can see that the dplyr query is converted to SQL like the following.
The next version of DuckDB can execute the same operation by using
COLUMNS()
. It seems equivalent to dplyr'sfilter(if_all())
.Select
Columns can be selected using regular expressions.
In DuckDB, this could be done as follows.
Update columns
We can update the selected columns.
Unfortunately, it is difficult to manipulate column names in DuckDB with
COLUMNS(*)
.Select columns using lambda function
Note that tidyselect can also be used to select columns using lambda functions, but this is not supported by dbplyr.
c.f. https://news.ycombinator.com/item?id=30067462