Open etiennebacher opened 10 months ago
If you come across this issue and want to help, take a look at https://tidypolars.etiennebacher.com/contributing#how-to-add-support-for-an-r-function-in-tidypolars
I second this!
As someone new to polars, I've found {tidypolars} very helpful and a great tool. I came here to ask for more lubridate support and saw the current issue and wanted to give a +1.
I was trying to use floor_date and make_date with no luck. Also, there isn't a lot of documentation on polars for time-series in R so I did my own workaround (for now.) Not sure if anyone has any recommendations but based off the r-polars vignette and python user guide I found this workflow the best for now.
# method 1, slow concatenation with paste:
tictoc::tic()
data_pl |>
mutate(
hour = hour(started_at),
month = month(started_at),
year = year(started_at),
mday = mday(started_at)
) |>
mutate( # already too slow w/o even converting str to datetime
str_dt = paste0(year, "-", month, "-", mday, " ", hour)
)
tictoc::toc() #6 secs
# method 2, multi-step method mutate & with_column:
tictoc::tic()
chpt <- data_pl |>
mutate( # each piece
hour = hour(started_at),
month = month(started_at),
year = year(started_at),
mday = mday(started_at),
dash = "-"
)
chpt <- chpt$with_columns(
# faster concatenate, needs a spacer [dash]:
pl$concat_str("year", "dash", "month", "dash", "mday")$alias("x")
)
# convert to date/datetime
chpt$with_columns(
pl$col("x")$str$to_date("%Y-%m-%d")
)
tictoc::toc() #.4 secs
note this is on ~ 1million rows of citibike data.
Hi, thanks for your interest in tidypolars
. I think the best matches for lubridate::make_date()
and make_datetime()
are pl.date()
and pl.datetime()
in python polars, but apparently they're not in r-polars yet.
As I said above, I don't have much time to dedicate to compatibility with lubridate
as I rarely use datetime variables, but I'd be happy to review a PR, even an incomplete one. Let me know if you want to try making one and if you need some help. The link above should give most necessary info to get started, but feel free to ask here if I forgot something.
For reference, here's a small reprex for your example, with a shorter syntax for method 2:
library(tidypolars)
library(polars)
library(dplyr, warn.conflicts = FALSE)
foo <- pl$DataFrame(x = rep("2009-08-03 12:01:59", 1e6))$select(pl$col("x")$str$to_datetime())
foo2 <- foo |>
mutate(
hour = hour(x),
month = month(x),
year = year(x),
mday = mday(x)
)
system.time({
foo2 |>
mutate(
str_dt = paste0(year, "-", month, "-", mday)
)
})
#> user system elapsed
#> 1.18 0.05 1.25
system.time({
foo2$with_columns(
# eventually this should be replaced by `str_dt = pl$date("year", "month", "day")`
str_dt = pl$concat_str("year", pl$lit("-"), "month", pl$lit("-"), "mday")$str$to_date("%Y-%m-%d")
)
})
#> user system elapsed
#> 0.09 0.01 0.11
I think the best matches for lubridate::make_date() and make_datetime() are pl.date() and pl.datetime() in python polars, but apparently they're not in r-polars yet.
They are now available in the development version of r-polars and will be included in polars 0.16.0. Here's an example with 50M obs:
library(polars)
test <- pl$DataFrame(
y = sample(2000:2019, 5*1e7, TRUE),
m = sample(1:12, 5*1e7, TRUE),
d = sample(1:31, 5*1e7, TRUE)
)
system.time({
test$with_columns(
date = pl$concat_str("y", pl$lit("-"), "m", pl$lit("-"), "d")$str$to_date("%Y-%m-%d", strict = FALSE)
)$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬────────────┐
#> │ y ┆ m ┆ d ┆ date │
#> │ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ i32 ┆ i32 ┆ date │
#> ╞══════╪═════╪═════╪════════════╡
#> │ 2011 ┆ 10 ┆ 22 ┆ 2011-10-22 │
#> │ 2016 ┆ 6 ┆ 16 ┆ 2016-06-16 │
#> │ 2007 ┆ 4 ┆ 21 ┆ 2007-04-21 │
#> │ 2012 ┆ 2 ┆ 9 ┆ 2012-02-09 │
#> │ 2014 ┆ 11 ┆ 25 ┆ 2014-11-25 │
#> │ … ┆ … ┆ … ┆ … │
#> │ 2002 ┆ 3 ┆ 26 ┆ 2002-03-26 │
#> │ 2001 ┆ 1 ┆ 21 ┆ 2001-01-21 │
#> │ 2011 ┆ 12 ┆ 18 ┆ 2011-12-18 │
#> │ 2009 ┆ 9 ┆ 18 ┆ 2009-09-18 │
#> │ 2012 ┆ 5 ┆ 19 ┆ 2012-05-19 │
#> └──────┴─────┴─────┴────────────┘
#> user system elapsed
#> 4.76 0.82 5.66
### NEW
system.time({
test$with_columns(date = pl$date("y", "m", "d"))$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬────────────┐
#> │ y ┆ m ┆ d ┆ date │
#> │ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ i32 ┆ i32 ┆ date │
#> ╞══════╪═════╪═════╪════════════╡
#> │ 2011 ┆ 10 ┆ 22 ┆ 2011-10-22 │
#> │ 2016 ┆ 6 ┆ 16 ┆ 2016-06-16 │
#> │ 2007 ┆ 4 ┆ 21 ┆ 2007-04-21 │
#> │ 2012 ┆ 2 ┆ 9 ┆ 2012-02-09 │
#> │ 2014 ┆ 11 ┆ 25 ┆ 2014-11-25 │
#> │ … ┆ … ┆ … ┆ … │
#> │ 2002 ┆ 3 ┆ 26 ┆ 2002-03-26 │
#> │ 2001 ┆ 1 ┆ 21 ┆ 2001-01-21 │
#> │ 2011 ┆ 12 ┆ 18 ┆ 2011-12-18 │
#> │ 2009 ┆ 9 ┆ 18 ┆ 2009-09-18 │
#> │ 2012 ┆ 5 ┆ 19 ┆ 2012-05-19 │
#> └──────┴─────┴─────┴────────────┘
#> user system elapsed
#> 2.64 0.41 3.06
system.time({
test$with_columns(date = pl$datetime("y", "m", "d"))$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬─────────────────────┐
#> │ y ┆ m ┆ d ┆ date │
#> │ --- ┆ --- ┆ --- ┆ --- │
#> │ i32 ┆ i32 ┆ i32 ┆ datetime[μs] │
#> ╞══════╪═════╪═════╪═════════════════════╡
#> │ 2011 ┆ 10 ┆ 22 ┆ 2011-10-22 00:00:00 │
#> │ 2016 ┆ 6 ┆ 16 ┆ 2016-06-16 00:00:00 │
#> │ 2007 ┆ 4 ┆ 21 ┆ 2007-04-21 00:00:00 │
#> │ 2012 ┆ 2 ┆ 9 ┆ 2012-02-09 00:00:00 │
#> │ 2014 ┆ 11 ┆ 25 ┆ 2014-11-25 00:00:00 │
#> │ … ┆ … ┆ … ┆ … │
#> │ 2002 ┆ 3 ┆ 26 ┆ 2002-03-26 00:00:00 │
#> │ 2001 ┆ 1 ┆ 21 ┆ 2001-01-21 00:00:00 │
#> │ 2011 ┆ 12 ┆ 18 ┆ 2011-12-18 00:00:00 │
#> │ 2009 ┆ 9 ┆ 18 ┆ 2009-09-18 00:00:00 │
#> │ 2012 ┆ 5 ┆ 19 ┆ 2012-05-19 00:00:00 │
#> └──────┴─────┴─────┴─────────────────────┘
#> user system elapsed
#> 2.25 0.53 2.78
first of all, thank you for your modification of method 2 with lit()
! Using a column called "dash" felt hackish but was the only way I could figure this out. I knew I was missing something related to the polars interface.
second, I had not thought about checking the development version. The new support for $date and $datetime is mainly what I am after! This is great to hear.
lastly, I still give this ticket a +1 for more support for lubridate, but don't think I'm ready for a PR on it. The help you gave me is exactly what I am after for now
second, I had not thought about checking the development version.
Even if you did, I only added it in polars
because you participated here 😉
lastly, I still give this ticket a +1 for more support for lubridate, but don't think I'm ready for a PR on it. The help you gave me is exactly what I am after for now
Once polars
0.16.0 is out, I'll make a PR to add support for make_date()
and make_datetime()
. I'll try to make it as clear as possible so that other people can imitate it to add support for other functions.
@frankiethull I have added support for make_date()
in #108. As you can see here, only 3 lines were needed to add support, and the rest is only testing (there's one small change in the internals but that's not something you'd have to implement yourself). Of course, that doesn't mean it's always so easy, but most of the time it shouldn't be too long.
If you want to try to implement some lubridate
function, I'd be happy to provide some guidance (but you can already take a look at the required steps here: https://tidypolars.etiennebacher.com/contributing#how-to-add-support-for-an-r-function-in-tidypolars).
polars
has tons ofdatetime
functions (not all are supported in the R implementation for now) but I don't uselubridate
enough to thorougly test them (I don't have real workflows where I can test that they work as expected).Some help on this would be greatly appreciated. The way to add support for new functions is a bit convoluted, I should make that easier, happy to help if someone wants to take a shot.