etiennebacher / tidypolars

Get the power of polars with the syntax of the tidyverse
https://tidypolars.etiennebacher.com
Other
172 stars 3 forks source link

More support for `lubridate` #53

Open etiennebacher opened 10 months ago

etiennebacher commented 10 months ago

polars has tons of datetime functions (not all are supported in the R implementation for now) but I don't use lubridate enough to thorougly test them (I don't have real workflows where I can test that they work as expected).

Some help on this would be greatly appreciated. The way to add support for new functions is a bit convoluted, I should make that easier, happy to help if someone wants to take a shot.

etiennebacher commented 9 months ago

If you come across this issue and want to help, take a look at https://tidypolars.etiennebacher.com/contributing#how-to-add-support-for-an-r-function-in-tidypolars

frankiethull commented 6 months ago

I second this!

As someone new to polars, I've found {tidypolars} very helpful and a great tool. I came here to ask for more lubridate support and saw the current issue and wanted to give a +1.

I was trying to use floor_date and make_date with no luck. Also, there isn't a lot of documentation on polars for time-series in R so I did my own workaround (for now.) Not sure if anyone has any recommendations but based off the r-polars vignette and python user guide I found this workflow the best for now.

initial test trying to make a date using paste0, too slow

# method 1, slow concatenation with paste:
tictoc::tic()
data_pl |>
  mutate(
    hour  = hour(started_at),
    month = month(started_at),
    year  = year(started_at),
    mday  = mday(started_at)
  ) |> 
  mutate( # already too slow w/o even converting str to datetime
    str_dt = paste0(year, "-", month, "-", mday, " ", hour)
  )
tictoc::toc() #6 secs

making a date with concat_str and to_date is way faster

# method 2, multi-step method mutate & with_column:
tictoc::tic()
chpt <- data_pl |>
  mutate( # each piece
    hour  = hour(started_at),
    month = month(started_at),
    year  = year(started_at),
    mday  = mday(started_at),
    dash = "-"
  ) 
chpt <- chpt$with_columns(
   # faster concatenate, needs a spacer [dash]:
    pl$concat_str("year", "dash", "month", "dash", "mday")$alias("x")
  )
# convert to date/datetime
chpt$with_columns(
  pl$col("x")$str$to_date("%Y-%m-%d")
)
tictoc::toc() #.4 secs

note this is on ~ 1million rows of citibike data.

etiennebacher commented 6 months ago

Hi, thanks for your interest in tidypolars. I think the best matches for lubridate::make_date() and make_datetime() are pl.date() and pl.datetime() in python polars, but apparently they're not in r-polars yet.

As I said above, I don't have much time to dedicate to compatibility with lubridate as I rarely use datetime variables, but I'd be happy to review a PR, even an incomplete one. Let me know if you want to try making one and if you need some help. The link above should give most necessary info to get started, but feel free to ask here if I forgot something.


For reference, here's a small reprex for your example, with a shorter syntax for method 2:

library(tidypolars)
library(polars)
library(dplyr, warn.conflicts = FALSE)

foo <- pl$DataFrame(x = rep("2009-08-03 12:01:59", 1e6))$select(pl$col("x")$str$to_datetime())

foo2 <- foo |>
  mutate(
    hour  = hour(x),
    month = month(x),
    year  = year(x),
    mday  = mday(x)
  )

system.time({
  foo2 |> 
    mutate( 
      str_dt = paste0(year, "-", month, "-", mday)
    )
})
#>    user  system elapsed 
#>    1.18    0.05    1.25

system.time({
  foo2$with_columns(
    # eventually this should be replaced by `str_dt = pl$date("year", "month", "day")`
    str_dt = pl$concat_str("year", pl$lit("-"), "month", pl$lit("-"), "mday")$str$to_date("%Y-%m-%d")
  )
})
#>    user  system elapsed 
#>    0.09    0.01    0.11
etiennebacher commented 6 months ago

I think the best matches for lubridate::make_date() and make_datetime() are pl.date() and pl.datetime() in python polars, but apparently they're not in r-polars yet.

They are now available in the development version of r-polars and will be included in polars 0.16.0. Here's an example with 50M obs:

library(polars)

test <- pl$DataFrame(
  y = sample(2000:2019, 5*1e7, TRUE),
  m = sample(1:12, 5*1e7, TRUE),
  d = sample(1:31, 5*1e7, TRUE)
)

system.time({
  test$with_columns(
    date = pl$concat_str("y", pl$lit("-"), "m", pl$lit("-"), "d")$str$to_date("%Y-%m-%d", strict = FALSE)
  )$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬────────────┐
#> │ y    ┆ m   ┆ d   ┆ date       │
#> │ ---  ┆ --- ┆ --- ┆ ---        │
#> │ i32  ┆ i32 ┆ i32 ┆ date       │
#> ╞══════╪═════╪═════╪════════════╡
#> │ 2011 ┆ 10  ┆ 22  ┆ 2011-10-22 │
#> │ 2016 ┆ 6   ┆ 16  ┆ 2016-06-16 │
#> │ 2007 ┆ 4   ┆ 21  ┆ 2007-04-21 │
#> │ 2012 ┆ 2   ┆ 9   ┆ 2012-02-09 │
#> │ 2014 ┆ 11  ┆ 25  ┆ 2014-11-25 │
#> │ …    ┆ …   ┆ …   ┆ …          │
#> │ 2002 ┆ 3   ┆ 26  ┆ 2002-03-26 │
#> │ 2001 ┆ 1   ┆ 21  ┆ 2001-01-21 │
#> │ 2011 ┆ 12  ┆ 18  ┆ 2011-12-18 │
#> │ 2009 ┆ 9   ┆ 18  ┆ 2009-09-18 │
#> │ 2012 ┆ 5   ┆ 19  ┆ 2012-05-19 │
#> └──────┴─────┴─────┴────────────┘
#>    user  system elapsed 
#>    4.76    0.82    5.66

### NEW

system.time({
  test$with_columns(date = pl$date("y", "m", "d"))$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬────────────┐
#> │ y    ┆ m   ┆ d   ┆ date       │
#> │ ---  ┆ --- ┆ --- ┆ ---        │
#> │ i32  ┆ i32 ┆ i32 ┆ date       │
#> ╞══════╪═════╪═════╪════════════╡
#> │ 2011 ┆ 10  ┆ 22  ┆ 2011-10-22 │
#> │ 2016 ┆ 6   ┆ 16  ┆ 2016-06-16 │
#> │ 2007 ┆ 4   ┆ 21  ┆ 2007-04-21 │
#> │ 2012 ┆ 2   ┆ 9   ┆ 2012-02-09 │
#> │ 2014 ┆ 11  ┆ 25  ┆ 2014-11-25 │
#> │ …    ┆ …   ┆ …   ┆ …          │
#> │ 2002 ┆ 3   ┆ 26  ┆ 2002-03-26 │
#> │ 2001 ┆ 1   ┆ 21  ┆ 2001-01-21 │
#> │ 2011 ┆ 12  ┆ 18  ┆ 2011-12-18 │
#> │ 2009 ┆ 9   ┆ 18  ┆ 2009-09-18 │
#> │ 2012 ┆ 5   ┆ 19  ┆ 2012-05-19 │
#> └──────┴─────┴─────┴────────────┘
#>    user  system elapsed 
#>    2.64    0.41    3.06

system.time({
  test$with_columns(date = pl$datetime("y", "m", "d"))$print()
})
#> shape: (50_000_000, 4)
#> ┌──────┬─────┬─────┬─────────────────────┐
#> │ y    ┆ m   ┆ d   ┆ date                │
#> │ ---  ┆ --- ┆ --- ┆ ---                 │
#> │ i32  ┆ i32 ┆ i32 ┆ datetime[μs]        │
#> ╞══════╪═════╪═════╪═════════════════════╡
#> │ 2011 ┆ 10  ┆ 22  ┆ 2011-10-22 00:00:00 │
#> │ 2016 ┆ 6   ┆ 16  ┆ 2016-06-16 00:00:00 │
#> │ 2007 ┆ 4   ┆ 21  ┆ 2007-04-21 00:00:00 │
#> │ 2012 ┆ 2   ┆ 9   ┆ 2012-02-09 00:00:00 │
#> │ 2014 ┆ 11  ┆ 25  ┆ 2014-11-25 00:00:00 │
#> │ …    ┆ …   ┆ …   ┆ …                   │
#> │ 2002 ┆ 3   ┆ 26  ┆ 2002-03-26 00:00:00 │
#> │ 2001 ┆ 1   ┆ 21  ┆ 2001-01-21 00:00:00 │
#> │ 2011 ┆ 12  ┆ 18  ┆ 2011-12-18 00:00:00 │
#> │ 2009 ┆ 9   ┆ 18  ┆ 2009-09-18 00:00:00 │
#> │ 2012 ┆ 5   ┆ 19  ┆ 2012-05-19 00:00:00 │
#> └──────┴─────┴─────┴─────────────────────┘
#>    user  system elapsed 
#>    2.25    0.53    2.78
frankiethull commented 6 months ago

first of all, thank you for your modification of method 2 with lit()! Using a column called "dash" felt hackish but was the only way I could figure this out. I knew I was missing something related to the polars interface.

second, I had not thought about checking the development version. The new support for $date and $datetime is mainly what I am after! This is great to hear.

lastly, I still give this ticket a +1 for more support for lubridate, but don't think I'm ready for a PR on it. The help you gave me is exactly what I am after for now

etiennebacher commented 6 months ago

second, I had not thought about checking the development version.

Even if you did, I only added it in polars because you participated here 😉

lastly, I still give this ticket a +1 for more support for lubridate, but don't think I'm ready for a PR on it. The help you gave me is exactly what I am after for now

Once polars 0.16.0 is out, I'll make a PR to add support for make_date() and make_datetime(). I'll try to make it as clear as possible so that other people can imitate it to add support for other functions.

etiennebacher commented 5 months ago

@frankiethull I have added support for make_date() in #108. As you can see here, only 3 lines were needed to add support, and the rest is only testing (there's one small change in the internals but that's not something you'd have to implement yourself). Of course, that doesn't mean it's always so easy, but most of the time it shouldn't be too long.

If you want to try to implement some lubridate function, I'd be happy to provide some guidance (but you can already take a look at the required steps here: https://tidypolars.etiennebacher.com/contributing#how-to-add-support-for-an-r-function-in-tidypolars).