etiennebacher / tidypolars

Get the power of polars with the syntax of the tidyverse
https://tidypolars.etiennebacher.com
Other
172 stars 3 forks source link

Improve the dispatch from R function to polars translation #113

Closed etiennebacher closed 4 months ago

etiennebacher commented 4 months ago

Currently tidypolars translate functions by prefixing the function name with pl_, which has 2 limitations:

  1. it doesn't support package namespace (e.g dplyr::) in expressions
  2. if two packages have the same function (e.g dplyr::lag() and stats::lag()) then I have to favor one or handle arguments in convoluted way

One solution could be to do like arrow (based on quick glance at their internals) and populate an environment that they call .cache.

Edit: actually an environment may not be required, I just need to extract the info on which namespace a function comes from and then have functions like pl_dplyr_lag and pl_stats_lag:

getNamespaceName(environment(lag))
#>    name 
#> "stats"
library(dplyr, warn.conflicts = FALSE)
getNamespaceName(environment(lag))
#>    name 
#> "dplyr"
getNamespaceName(environment(stats::lag))
#>    name 
#> "stats"
eitsupi commented 4 months ago

2. if two packages have the same function (e.g dplyr::lag() and stats::lag()) then I have to favor one or handle arguments in convoluted way

IIUC, within arrow_dplyr_query it is not recognized which package the function came from. There are simply two functions registered, for example foo::bar and bar. See apache/arrow#13160

etiennebacher commented 4 months ago

Apparently arrow doesn't detect some cases when a function is masked by a package:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)

df <- data.frame(x = as.Date("2020-01-01"))
mt <- arrow_table(df, as_data_frame = FALSE)

# Rightfully errors since data.table::quarter() only has arg "x"
df |> 
  mutate(dt = quarter(x, fiscal_start = 2))
#> Error in `mutate()`:
#> ℹ In argument: `dt = quarter(x, fiscal_start = 2)`.
#> Caused by error in `quarter()`:
#> ! unused argument (fiscal_start = 2)
mt |> 
  mutate(dt = quarter(x, fiscal_start = 2)) |> 
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Function 'quarter' accepts 1 arguments but 2 passed

library(lubridate, warn.conflicts = FALSE)

# ideally shouldn't error because lubridate::quarter() now masks data.table::quarter()
df |> 
  mutate(lub = quarter(x, fiscal_start = 2)) 
#>            x lub
#> 1 2020-01-01   4
mt |> 
  mutate(lub = quarter(x, fiscal_start = 2)) |> 
  collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Function 'quarter' accepts 1 arguments but 2 passed
eitsupi commented 4 months ago

From reading the source code, it appears that the quarter function is mapped directly to the libarrow compute kernel's quarter function in arrow_dplyr_query to begin with, and takes only one argument.

https://github.com/apache/arrow/blob/53859262ea988f31ce33a469305251064b5a53b8/r/R/dplyr-funcs-simple.R#L19-L78