Closed etiennebacher closed 12 months ago
Do not collect a 1-row slice but instead use the schema to create an empty DataFrame with the same columns and types, and use this in tidyselect. This shows performance improvements when we chain several functions, not just select().
tidyselect
select()
Benchmark
large_iris <- data.table::rbindlist(rep(list(iris), 50000)) test <- as_polars(large_iris, lazy = TRUE) bench::mark( starts_with = test |> select(starts_with(c("Sep", "Pet"))) |> mutate( petal_type = ifelse((Petal.Length / Petal.Width) > 3, "long", "large") ) |> filter(between(Sepal.Length, 4.5, 5.5)) |> collect(), iterations = 20, check = FALSE ) |> dplyr::select(expression, 3:5)
Before:
# A tibble: 1 × 4 expression median `itr/sec` mem_alloc <bch:expr> <bch:tm> <dbl> <bch:byt> 1 starts_with 495ms 1.59 1.05MB
After:
# A tibble: 1 × 4 expression median `itr/sec` mem_alloc <bch:expr> <bch:tm> <dbl> <bch:byt> 1 starts_with 171ms 5.68 954KB
Do not collect a 1-row slice but instead use the schema to create an empty DataFrame with the same columns and types, and use this in
tidyselect
. This shows performance improvements when we chain several functions, not justselect()
.Benchmark
Before:
After: