Open sorhawell opened 1 year ago
this variant does not use <Expr>$map
which would allow running one R expr in parallel with any polars exprs simultanously. I should do that.
edit: ok that would require a new tailored <Expr>$map
function in polars which caches converted columns. Then better to solve together with zero-copy conversion.
Thank you!
It would be possible to use arrow c_datainterface + R altrep to avoid the polars->R conversion ... maybe also the R->polars conversion
That's beyond my knowledge, I'll leave that to you if you want to explore (but developing r-polars
is already a lot of work so no pressure)
this variant does not use
$map which would allow running one R expr in parallel with any polars exprs simultanously. I should do that.
Is there a reason to prefer
Mimicking mutate()
and summarize()
without losing too much in efficiency is probably gonna be tricky for me as I'm not comfortable enough with all the internals about parallelization and other things. For now I'll try to cover most of the other functions that apply on the full data and not on specific columns (although mutate()
is probably the most important function so it should be tackled at some point)
Is there a reason to prefer $map over $apply?
99% of times one should pick $map in select contexts and $apply in GroupBy contexts. Apply in select is like scalar lapply on each value, and double overhead of lapply. Map in GroupBy ignores GroupBy, apply should be used.
I would rename the two right of the four methods map() and two wrong ones map_dont_ever_use_me() :)
Rust-polars recently added a feature that detects if apply()
was used and proposes a replacement with Polars expressions (if possible). It would be cool to do the same with user-made functions but it sounds quite challenging (basically I'd need to parse custom R functions, which can be very long, and rewrite them correctly with Polars).
Probably a more realistic way is to encourage people to write their custom functions directly in Polars expressions. Then I could check whether the function returns a Polars expression and warn the user if it doesn't.
That said, it's gonna introduce some ambiguity because:
Example for writing functions with Polars expressions:
foo <- function(x, y) {
tmp <- polars::pl$mean(x)
tmp2 <- polars::pl$mean(y)
tmp + tmp2
}
foo("a", "b")
#> polars Expr: [(col("a").mean()) + (col("b").mean())]
class(foo("a", "b"))
#> [1] "Expr"
polars::pl$DataFrame(mtcars)$groupby("am")$agg(
foo("drat", "mpg")$alias("test")
)
#> shape: (2, 2)
#> ┌─────┬───────────┐
#> │ am ┆ test │
#> │ --- ┆ --- │
#> │ f64 ┆ f64 │
#> ╞═════╪═══════════╡
#> │ 1.0 ┆ 28.442308 │
#> │ 0.0 ┆ 20.433684 │
#> └─────┴───────────┘
Now possible: use custom functions that return a Polars expression:
library(tidypolars)
library(dplyr, warn.conflicts = FALSE)
foo <- function(x, y, z) {
tmp <- x$mean() + y$mean()
tmp / z$sum()
}
foo_dplyr <- function(x, y, z) {
tmp <- mean(x) + mean(y)
tmp / sum(z)
}
large_iris <- data.table::rbindlist(rep(list(iris), 100000))
large_iris_pl <- as_polars(large_iris)
bench::mark(
dplyr = large_iris |>
group_by(Species) |>
mutate(foo = foo_dplyr(Sepal.Length, Sepal.Width, Petal.Length)),
tidypolars = large_iris_pl |>
group_by(Species) |>
mutate(foo = foo(Sepal.Length, Sepal.Width, Petal.Length)),
iterations = 10,
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 877ms 877ms 1.14 689.72MB 14.8
#> 2 tidypolars 152ms 174ms 5.72 2.65MB 0
Hey @etiennebacher I really like tidypolars and how it integrates with polars! Very smart.
I was thinking it could be possible to allow pure R syntax also with some performance loss. Sometimes a user cannot figure out how todo something in polars and the performance does not matter for that step.
Here it is likely slower than dplyr, as the used columns must be transformed (vectorized) from polars to R first ... and then the output back to polars. It would be possible to use arrow c_datainterface + R altrep to avoid the polars->R conversion ... maybe also the R->polars conversion. Then monkey_mutate would be just as fast dplyrs.
Created on 2023-06-09 with reprex v2.0.2