etiennebacher / tidypolars

Get the power of polars with the syntax of the tidyverse
https://tidypolars.etiennebacher.com
Other
179 stars 4 forks source link

Make `tidyselect` faster #61

Closed etiennebacher closed 12 months ago

etiennebacher commented 12 months ago

Do not collect a 1-row slice but instead use the schema to create an empty DataFrame with the same columns and types, and use this in tidyselect. This shows performance improvements when we chain several functions, not just select().

Benchmark

large_iris <- data.table::rbindlist(rep(list(iris), 50000))
test <- as_polars(large_iris, lazy = TRUE)

bench::mark(
  starts_with = test |>
    select(starts_with(c("Sep", "Pet"))) |>
    mutate(
      petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large")
    ) |> 
    filter(between(Sepal.Length, 4.5, 5.5)) |> 
    collect(),
  iterations = 20,
  check = FALSE
) |> 
  dplyr::select(expression, 3:5)

Before:

# A tibble: 1 × 4
  expression    median `itr/sec` mem_alloc
  <bch:expr>  <bch:tm>     <dbl> <bch:byt>
1 starts_with    495ms      1.59    1.05MB

After:

# A tibble: 1 × 4
  expression    median `itr/sec` mem_alloc
  <bch:expr>  <bch:tm>     <dbl> <bch:byt>
1 starts_with    171ms      5.68     954KB