`tidyselect` like column selection and update

eitsupi commented 1 year ago

Column selection with tidyselect is a very powerful feature of dplyr. https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html

Similar features have recently been introduced in ibis. https://github.com/ibis-project/ibis/pull/5307

There is also a dbt package that mimics this functionality. https://github.com/emilyriederer/dbtplyr

I believe that many functions of tidyselect can be realized only when the schema is obvious, but I know that DuckDB has COLUMNS() and some functions can be realized (duckdb/duckdb#6621). So I am wondering if it is possible to partially implement this in PRQL like select ![foo] -> SELECT * EXCLUDE foo.

The following examples use dplyr and duckdb on R. DuckDB is the unreleased edge version (0.8.0). (Installed from R-universe)

df <- readr::read_csv(
  "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/v0.1.0/inst/extdata/penguins.csv",
  show_col_types = FALSE
)

df
#> # A tibble: 344 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <chr>   <chr>              <dbl>         <dbl>      <dbl>   <dbl> <chr> <dbl>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
#>  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
#> 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
#> # … with 334 more rows, and abbreviated variable names ¹<200b>flipper_length_mm,
#> #   ²<200b>body_mass_g

con <- DBI::dbConnect(duckdb::duckdb())
duckdb::duckdb_register(con, "df", df)

Filter

The most common use would be to remove rows that contain null in either column.

The tidyverse also has a dedicated function called tidyr::drop_na, but as a result of the introduction of dplyr::across in dplyr 1.0, it can be executed by dplyr alone. (if_all and if_any are functions derived from across)

df |>
  dplyr::filter(!dplyr::if_any(tidyselect::everything(), is.na))
#> # A tibble: 333 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <chr>   <chr>              <dbl>         <dbl>      <dbl>   <dbl> <chr> <dbl>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
#>  9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
#> 10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
#> # … with 323 more rows, and abbreviated variable names ¹<200b>flipper_length_mm,
#> #   ²<200b>body_mass_g

df |>
  dplyr::filter(dplyr::if_all(tidyselect::everything(), \(x) !is.na(x)))
#> # A tibble: 333 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <chr>   <chr>              <dbl>         <dbl>      <dbl>   <dbl> <chr> <dbl>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
#>  9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
#> 10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
#> # … with 323 more rows, and abbreviated variable names ¹<200b>flipper_length_mm,
#> #   ²<200b>body_mass_g

Using dbplyr, we can see that the dplyr query is converted to SQL like the following.

dplyr::tbl(con, "df") |>
  dplyr::filter(!dplyr::if_any(tidyselect::everything(), is.na)) |>
  dplyr::show_query()
#> <SQL>
#> SELECT *
#> FROM df
#> WHERE (NOT((species IS NULL) OR (island IS NULL) OR (bill_length_mm IS NULL) OR (bill_depth_mm IS NULL) OR (flipper_length_mm IS NULL) OR (body_mass_g IS NULL) OR (sex IS NULL) OR ("year" IS NULL)))

The next version of DuckDB can execute the same operation by using COLUMNS(). It seems equivalent to dplyr's filter(if_all()).

duckdb::sql("FROM df WHERE COLUMNS(*) IS NOT NULL", con) |>
  dplyr::as_tibble()
#> # A tibble: 333 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <chr>   <chr>              <dbl>         <dbl>      <dbl>   <dbl> <chr> <dbl>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
#>  9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
#> 10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
#> # … with 323 more rows, and abbreviated variable names ¹<200b>flipper_length_mm,
#> #   ²<200b>body_mass_g

Select

Columns can be selected using regular expressions.

df |>
  dplyr::select(tidyselect::matches("_mm$"))
#> # A tibble: 344 × 3
#>    bill_length_mm bill_depth_mm flipper_length_mm
#>             <dbl>         <dbl>             <dbl>
#>  1           39.1          18.7               181
#>  2           39.5          17.4               186
#>  3           40.3          18                 195
#>  4           NA            NA                  NA
#>  5           36.7          19.3               193
#>  6           39.3          20.6               190
#>  7           38.9          17.8               181
#>  8           39.2          19.6               195
#>  9           34.1          18.1               193
#> 10           42            20.2               190
#> # … with 334 more rows

dplyr::tbl(con, "df") |>
  dplyr::select(tidyselect::matches("_mm$")) |>
  dplyr::show_query()
#> <SQL>
#> SELECT bill_length_mm, bill_depth_mm, flipper_length_mm
#> FROM df

In DuckDB, this could be done as follows.

duckdb::sql("FROM df SELECT COLUMNS(x -> x SIMILAR TO '.*_mm$')", con) |> # `SIMILAR TO` does't match partial string
  dplyr::as_tibble()
#> # A tibble: 344 × 3
#>    bill_length_mm bill_depth_mm flipper_length_mm
#>             <dbl>         <dbl>             <dbl>
#>  1           39.1          18.7               181
#>  2           39.5          17.4               186
#>  3           40.3          18                 195
#>  4           NA            NA                  NA
#>  5           36.7          19.3               193
#>  6           39.3          20.6               190
#>  7           38.9          17.8               181
#>  8           39.2          19.6               195
#>  9           34.1          18.1               193
#> 10           42            20.2               190
#> # … with 334 more rows

Update columns

We can update the selected columns.

df |>
  dplyr::mutate(dplyr::across(tidyselect::everything(), as.character))
#> # A tibble: 344 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex   year
#>    <chr>   <chr>     <chr>          <chr>         <chr>      <chr>   <chr> <chr>
#>  1 Adelie  Torgersen 39.1           18.7          181        3750    male  2007
#>  2 Adelie  Torgersen 39.5           17.4          186        3800    fema… 2007
#>  3 Adelie  Torgersen 40.3           18            195        3250    fema… 2007
#>  4 Adelie  Torgersen <NA>           <NA>          <NA>       <NA>    <NA>  2007
#>  5 Adelie  Torgersen 36.7           19.3          193        3450    fema… 2007
#>  6 Adelie  Torgersen 39.3           20.6          190        3650    male  2007
#>  7 Adelie  Torgersen 38.9           17.8          181        3625    fema… 2007
#>  8 Adelie  Torgersen 39.2           19.6          195        4675    male  2007
#>  9 Adelie  Torgersen 34.1           18.1          193        3475    <NA>  2007
#> 10 Adelie  Torgersen 42             20.2          190        4250    <NA>  2007
#> # … with 334 more rows, and abbreviated variable names ¹<200b>flipper_length_mm,
#> #   ²<200b>body_mass_g

dplyr::tbl(con, "df")  |>
  dplyr::mutate(dplyr::across(tidyselect::everything(), as.character)) |>
  dplyr::show_query()
#> <SQL>
#> SELECT
#>   CAST(species AS TEXT) AS species,
#>   CAST(island AS TEXT) AS island,
#>   CAST(bill_length_mm AS TEXT) AS bill_length_mm,
#>   CAST(bill_depth_mm AS TEXT) AS bill_depth_mm,
#>   CAST(flipper_length_mm AS TEXT) AS flipper_length_mm,
#>   CAST(body_mass_g AS TEXT) AS body_mass_g,
#>   CAST(sex AS TEXT) AS sex,
#>   CAST("year" AS TEXT) AS "year"
#> FROM df

Unfortunately, it is difficult to manipulate column names in DuckDB with COLUMNS(*).

duckdb::sql("FROM df SELECT COLUMNS(*)::VARCHAR", con) |>
  dplyr::tibble()
#> # A tibble: 344 × 8
#>    CAST(df.species AS …¹ CAST(…² CAST(…³ CAST(…⁴ CAST(…⁵ CAST(…⁶ CAST(…⁷ CAST(…⁸
#>    <chr>                 <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>
#>  1 Adelie                Torger… 39.1    18.7    181.0   3750.0  male    2007.0
#>  2 Adelie                Torger… 39.5    17.4    186.0   3800.0  female  2007.0
#>  3 Adelie                Torger… 40.3    18.0    195.0   3250.0  female  2007.0
#>  4 Adelie                Torger… <NA>    <NA>    <NA>    <NA>    <NA>    2007.0
#>  5 Adelie                Torger… 36.7    19.3    193.0   3450.0  female  2007.0
#>  6 Adelie                Torger… 39.3    20.6    190.0   3650.0  male    2007.0
#>  7 Adelie                Torger… 38.9    17.8    181.0   3625.0  female  2007.0
#>  8 Adelie                Torger… 39.2    19.6    195.0   4675.0  male    2007.0
#>  9 Adelie                Torger… 34.1    18.1    193.0   3475.0  <NA>    2007.0
#> 10 Adelie                Torger… 42.0    20.2    190.0   4250.0  <NA>    2007.0
#> # … with 334 more rows, and abbreviated variable names
#> #   ¹<200b>`CAST(df.species AS VARCHAR)`, ²<200b>`CAST(df.island AS VARCHAR)`,
#> #   ³<200b>`CAST(df.bill_length_mm AS VARCHAR)`,
#> #   ⁴<200b>`CAST(df.bill_depth_mm AS VARCHAR)`,
#> #   ⁵<200b>`CAST(df.flipper_length_mm AS VARCHAR)`,
#> #   ⁶<200b>`CAST(df.body_mass_g AS VARCHAR)`, ⁷<200b>`CAST(df.sex AS VARCHAR)`,
#> #   ⁸<200b>`CAST(df."year" AS VARCHAR)`

Select columns using lambda function

Note that tidyselect can also be used to select columns using lambda functions, but this is not supported by dbplyr.

c.f. https://news.ycombinator.com/item?id=30067462

hadley on Jan 25, 2022 | parent | context | favorite | on: PRQL – A proposal for a better SQL

From dbplyr experience, folks want to be able to do stuff like across(which(is.numeric), mean), which you can't do currently because dplyr doesn't know the column types (although it does maintain a list of the column names).

df |>
  dplyr::mutate(dplyr::across(tidyselect::where(is.numeric), \(x) dplyr::coalesce(x, 0L)))
#> # A tibble: 344 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
#>    <chr>   <chr>              <dbl>         <dbl>      <dbl>   <dbl> <chr> <dbl>
#>  1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
#>  2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
#>  3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
#>  4 Adelie  Torgersen            0             0            0       0 <NA>   2007
#>  5 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
#>  6 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
#>  7 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
#>  8 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
#>  9 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
#> 10 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
#> # … with 334 more rows, and abbreviated variable names ¹<200b>flipper_length_mm,
#> #   ²<200b>body_mass_g

max-sixty commented 1 year ago

Excellent write up @eitsupi !

~I'd split these sorts of functions — we can called them Abstraction Over Columns — into two broad categories:~

~Abstracting over the column names — e.g. SELECT price_* from foo~
~Abstracting over column values — e.g. WHERE COLUMNS(*) IS NOT NULL~

(Edit: I don't think these are that different on reflection)

I think the requirements are similar — either PRQL needs to know the column names itself, or the DB needs to support the abstraction. Without a DB abstraction, the SQL for Names is fairly concise, the SQL for Values is quite verbose, since it requires listing a set of conditions.~

Overall, I'm open to these. My take is that they're more complicated than the standard PRQL because of those constraints. When we have DB cohesion, they'll be more powerful, since providing the input columns won't be necessary. My focus for the project would still be to make the basic language more robust (we're still finding occasional bugs, many of which remain!), but there's no reason we can't do these in parallel.

What are others' thoughts?

snth commented 1 year ago

I quite like the way DuckDB does this with COLUMNS(*). It seems quite concise and practical. The main downside to me is that it's a bit magical and I would like whatever we come up with to be more explicit.

Not having thought about it too much yet, to me it seems that this veers into the territory of macros. I was thinking that we will probably need those anyway as I was thinking about them in the context of functions generalising over idents (maybe that can be done differently but that's how I've been thinking about it so far).

I'd split these sorts of functions — we can called them Abstraction Over Columns — into two broad categories:

Abstracting over the column names — e.g. SELECT price_* from foo

Abstracting over column values — e.g. WHERE COLUMNS(*) IS NOT NULL

Are these really different? My understanding from what I've read about COLUMNS in the DuckDB docs is that this really behaves like a macro expansion. So assuming we have a table foo with many columns, including three named price_a, price_b and price_c, then

SELECT MIN(COLUMNS(price_*)) FROM foo becomes SELECT MIN(price_a), MIN(price_b), MIN(price_c) FROM foo, and
WHERE COLUMNS(price_*) IS NOT NULL becomes WHERE price_a IS NOT NULL AND price_b IS NOT NULL AND price_c IS NOT NULL

The part of this that seems a bit magical to me is how does it know that the expansion stops at the MIN in the SELECT and in the WHERE clause example, why are these combined with AND and not OR?

I would propose implementing this with a macro named columns! using a ! to denote the macro as in Rust. Given our lack of parentheses, it's not clear to me how the compiler knows which parts to replicate. So the first example would be something like

select [min columns! price_*]

Is that clear enough or does it need parentheses?

select [min (columns! price_*)]

What if you have multiple functions?

select [columns! price_* | abs | min, columns! price_* | abs | max]

Could this be written better?

Looking at the second case

filter (columns! price_*) != null

I'm not sure about any of this. Just exploring some ideas really. Thoughts?

snth commented 1 year ago

I think the requirements are similar — either PRQL needs to know the column names itself, or the DB needs to support the abstraction.

Agree with this. Without knowledge of the schema, PRQL could only translate to the DuckDB syntax (or equivalent) where that's available. Once we are able to operate on a cached schema definition then we could do the column name expansions at compilation time.

max-sixty commented 1 year ago

Are these really different?

No, , on reflection, not really! I've updated my comment. Thanks.

What's the difference between a macro and a function? A function operates on a column, and a macro generates column names?

I would vote quite strongly against introducing a totally new language concept like macros given the current state of the language & project — adding this will add to the maintenance and complexity burden. (I'm a big +1 on things like the module system & type system, I'm not saying "no new features" — but we should be balancing the cost of new features with our confidence that they're going to have make the language more useful)

aljazerzen commented 1 year ago

In Rust, difference is that macros operate on syntactic level, before any type checking or anything semantic. But this is already very close to our functions.

We may be able to get this working without new syntax:

wildcard expansion: [price_*] is expanded into [price_a, price_b, price_c],

function map:

aggregate ([price_a, price_b, price_c] | map min)

function any:

filter (any [false, true, false])
filter ([price_*] | map (x -> x != null) | any)

This does require more parenthesis and knowledge of functional programming, but it is not magic.

For beginners, we could have a snippet to copy-paste / library / package that would contain useful stuff like:

func any_is_not_null cols -> (cols | map (x -> x != null) | any)

eitsupi commented 1 year ago

FYI, I was reading the source code of polars and know that there is starts_with and ends_with in polars SQL. https://github.com/pola-rs/polars/blob/8c09c85c782dfde18e6c2cb0ce0644d433dd887c/polars/polars-sql/src/functions.rs#L104-L115

SELECT STARTS_WITH(column_1, 'a') from df;
SELECT column_2 from df WHERE ENDS_WITH(column_1, 'a');

aljazerzen commented 1 year ago

I'd guess that STARTS_WITH is checking if the string in column_1 starts with 'a' and does not include all columns that start with column_a.

Am I missing something, why is this relevant?

eitsupi commented 1 year ago

Am I missing something, why is this relevant?

My apologies, my misunderstanding!

PRQL / prql