Allow pure R expression

sorhawell commented 1 year ago

Hey @etiennebacher I really like tidypolars and how it integrates with polars! Very smart.

I was thinking it could be possible to allow pure R syntax also with some performance loss. Sometimes a user cannot figure out how todo something in polars and the performance does not matter for that step.

Here it is likely slower than dplyr, as the used columns must be transformed (vectorized) from polars to R first ... and then the output back to polars. It would be possible to use arrow c_datainterface + R altrep to avoid the polars->R conversion ... maybe also the R->polars conversion. Then monkey_mutate would be just as fast dplyrs.

``` r
library(polars)
library(tidypolars)

pl_monkey_mutate <- function(data, ...) {

  tidypolars:::check_polars_data(data)

  dots <- tidypolars:::get_dots(...)

  exprs <- lapply(seq_along(dots), \(x) {
    # need to output several things:
    # - the name of the new (or modified var)
    # - the name of the variables used in the computation
    # - the name of the polars functions used in the computation

    x_expr <- dots[[x]]
    var_name <- names(dots)[x]
    deparsed <- deparse(x_expr)

    vars_used <- unlist(lapply(x_expr, as.character))
    vars_used <- unique(vars_used[which(vars_used %in% pl_colnames(data))])

    pl_funs <- regmatches(deparsed, gregexpr("pl\\_\\w+", deparsed))

    list(var_name = var_name, vars_used = vars_used, pl_funs = pl_funs,
         call = deparsed)
  })

  # expr with struct of all used columns
  expr_struct_vars_used <-
    lapply(exprs,\(x)x$vars_used) |>
    unlist() |>
    intersect(data$columns) |>
    lapply(pl$col) |>
    pl$struct()

  # instanciate columns in R and insert in frame between caller and evaluated expr
  data_context <- as.environment(
    data$select(expr_struct_vars_used)$to_list()[[1]]
  )
  data_context$.exprs <- exprs
  parent.env(data_context) <- parent.frame()
  data$with_columns(
    with(
      data_context,
      lapply(.exprs, \(x) pl$lit(eval(parse(text= x$call)))$alias(x$var_name))
    )
  )

}

#random varibale in callers frame above
foo <- "bar"
baz <- 3

#build columns only with R code
df = pl$DataFrame(iris) |>
  pl_monkey_mutate(
    Species2 = paste(Species,foo, Sepal.Length),
    Sepal.Long = Sepal.Width * baz 
  ) |>
  pl_select(Species2, Sepal.Long)

print(df)
#> shape: (150, 2)
#> ┌───────────────────┬────────────┐
#> │ Species2          ┆ Sepal.Long │
#> │ ---               ┆ ---        │
#> │ str               ┆ f64        │
#> ╞═══════════════════╪════════════╡
#> │ setosa bar 5.1    ┆ 10.5       │
#> │ setosa bar 4.9    ┆ 9.0        │
#> │ setosa bar 4.7    ┆ 9.6        │
#> │ setosa bar 4.6    ┆ 9.3        │
#> │ …                 ┆ …          │
#> │ virginica bar 6.3 ┆ 7.5        │
#> │ virginica bar 6.5 ┆ 9.0        │
#> │ virginica bar 6.2 ┆ 10.2       │
#> │ virginica bar 5.9 ┆ 9.0        │
#> └───────────────────┴────────────┘

^{Created on 2023-06-09 with reprex v2.0.2}

sorhawell commented 1 year ago

this variant does not use <Expr>$map which would allow running one R expr in parallel with any polars exprs simultanously. I should do that.

edit: ok that would require a new tailored <Expr>$map function in polars which caches converted columns. Then better to solve together with zero-copy conversion.

etiennebacher commented 1 year ago

Thank you!

It would be possible to use arrow c_datainterface + R altrep to avoid the polars->R conversion ... maybe also the R->polars conversion

That's beyond my knowledge, I'll leave that to you if you want to explore (but developing r-polars is already a lot of work so no pressure)

this variant does not use $map which would allow running one R expr in parallel with any polars exprs simultanously. I should do that.

Is there a reason to prefer $map over $apply?

Mimicking mutate() and summarize() without losing too much in efficiency is probably gonna be tricky for me as I'm not comfortable enough with all the internals about parallelization and other things. For now I'll try to cover most of the other functions that apply on the full data and not on specific columns (although mutate() is probably the most important function so it should be tackled at some point)

sorhawell commented 1 year ago

Is there a reason to prefer $map over $apply?

99% of times one should pick $map in select contexts and $apply in GroupBy contexts. Apply in select is like scalar lapply on each value, and double overhead of lapply. Map in GroupBy ignores GroupBy, apply should be used.

I would rename the two right of the four methods map() and two wrong ones map_dont_ever_use_me() :)

etiennebacher commented 1 year ago

Rust-polars recently added a feature that detects if apply() was used and proposes a replacement with Polars expressions (if possible). It would be cool to do the same with user-made functions but it sounds quite challenging (basically I'd need to parse custom R functions, which can be very long, and rewrite them correctly with Polars).

Probably a more realistic way is to encourage people to write their custom functions directly in Polars expressions. Then I could check whether the function returns a Polars expression and warn the user if it doesn't.

That said, it's gonna introduce some ambiguity because:

I encourage users to keep the classic R syntax for the expressions in mutate/summarize
I tell them that if they want to use custom functions in mutate/summarize, then they should write them with Polars syntax (which will be very new to people)

Example for writing functions with Polars expressions:

foo <- function(x, y) {
  tmp <- polars::pl$mean(x)
  tmp2 <- polars::pl$mean(y)
  tmp + tmp2
}

foo("a", "b")
#> polars Expr: [(col("a").mean()) + (col("b").mean())]

class(foo("a", "b"))
#> [1] "Expr"

polars::pl$DataFrame(mtcars)$groupby("am")$agg(
  foo("drat", "mpg")$alias("test")
)
#> shape: (2, 2)
#> ┌─────┬───────────┐
#> │ am  ┆ test      │
#> │ --- ┆ ---       │
#> │ f64 ┆ f64       │
#> ╞═════╪═══════════╡
#> │ 1.0 ┆ 28.442308 │
#> │ 0.0 ┆ 20.433684 │
#> └─────┴───────────┘

etiennebacher commented 1 year ago

Now possible: use custom functions that return a Polars expression:

library(tidypolars)
library(dplyr, warn.conflicts = FALSE)

foo <- function(x, y, z) {
  tmp <- x$mean() + y$mean()
  tmp / z$sum()
}

foo_dplyr <- function(x, y, z) {
  tmp <- mean(x) + mean(y)
  tmp / sum(z)
}

large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars(large_iris)

bench::mark(
  dplyr = large_iris |> 
    group_by(Species) |> 
    mutate(foo = foo_dplyr(Sepal.Length, Sepal.Width, Petal.Length)),
  tidypolars = large_iris_pl |> 
    group_by(Species) |> 
    mutate(foo = foo(Sepal.Length, Sepal.Width, Petal.Length)),
  iterations = 10,
  check = FALSE
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         877ms    877ms      1.14  689.72MB     14.8
#> 2 tidypolars    152ms    174ms      5.72    2.65MB      0

etiennebacher / tidypolars

Allow pure R expression #4