Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

Need examples in vignette on how the new programming interface should be used within a function. #5991

Open Fred-Wu opened 7 months ago

Fred-Wu commented 7 months ago

The current "Programming on data.table" vignette does not provide any examples on how the new programming interface should be used within a function when passing multiple data variables to a function is required.

By looking at the examples in a recent article by John MacKintosh, it seems that eval / substitute still needs to be used multiple times, which is not so different to the version before, while eval is suggested as a retired interface now.

starwars_dt <- setDT(copy(starwars))

my_summarise_dt <- function(.dt, ...) {

  vars <-  eval(substitute(alist(...)),
                envir = parent.frame())

  .dt[, lapply(.SD, mean, na.rm = TRUE),
      .SDcols =c("mass", "height"),
      by = vars,
      env = list(vars = substitute(vars))][]
}

Is this the intended usage of the new programing interface?

Could more examples be provided in the vignette, especially on how it should be used within a function?

jangorecki commented 7 months ago

Did you look into manual that was linked from vignette? Example is already there from the beginning. ?substitute2 Example uses base R.

I feel this vignette is already too long and covers various cases from very simple to more difficult ones. I don't think we should try to put all to a vignette. Manual is better place for that.

Your example lacks the use of function you defined. Also I would advise to use stackoverflow to questions on usage - other people may be looking for the same question in future.

Fred-Wu commented 7 months ago

Did you look into manual that was linked from vignette? Example is already there from the beginning. ?substitute2 Example uses base R.

I feel this vignette is already too long and covers various cases from very simple to more difficult ones. I don't think we should try to put all to a vignette. Manual is better place for that.

Your example lacks the use of function you defined. Also I would advise to use stackoverflow to questions on usage - other people may be looking for the same question in future.

The most questions on stackoverflow in the past around programming on data.table are how to wrap it in a function, and pass variables as function arguments. This already can be achieved by using substitute/eval.

However by looking at the "new programming interface", I am confused what's new about it, and how's it different to the old way.

The programming interface from dplyr is easier to pass variable name as function argument like below.

var_summary <- function(data, var) {
  data %>%
    summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>% 
  group_by(cyl) %>% 
  var_summary(mpg)

I was then wondering whether data.table's new interface make this easier as well?

In the past, I can just do deparse/substitute/eval to pass the variables from the data to the function argument

overlay <- function(lng, lat, data, basemap) {

    lng_name <- deparse(substitute(lng))
    lat_name <- deparse(substitute(lat))
    data_name <- deparse(substitute(data))

    geo <- substitute(list(lng, lat))
    n_site <- dim(data)[1]
    boundary_name <- grep("CODE|MAIN", toupper(names(basemap)), value = TRUE)
    site_over_map <- data[, eval(geo)] %>%
        .[,  eval(boundary_name) := as.character(NA)]

    coord <- st_as_sf(site_over_map[, 1:2], coords = 1:2, crs = st_crs(basemap))
    overlay_ <- st_intersects(coord, basemap)
    empty2NA_ <- sapply(overlay_, function(x) ifelse(length(x) == 0, NA, x))
    code <- sapply(empty2NA_, function(x) {
        ifelse(is.na(x), NA, basemap[[boundary_name]][x])
    })
    site_over_map[, 3] <- code

    return(site_over_map)
}

However, from other's explorations, the new interface is not too different from what can be done previously.

Therefore, I was asking how the new programming interface could be used when data.base is to be used in a function, while function argument is variable names, whether such examples could be provided in the vignette so that people could know how to use them correctly.

The example in the original post is to pass several variable names as a function argument that to be used in by, this is also quite common in my user case. However, it still requires to use eval and substitute twice.

jangorecki commented 7 months ago

If your function takes symbols which then need to pass downstream as symbols then you have to use substitute. This is not related to data.table but to Standard Evaluation rules in base R.

Use of Standard Evaluation rather than NSE in the new env argument was design decision of mine. Therefore you cannot expect env argument to get symbols that are providing its value. For the same reason .() will not work as alias to list() in env arg as it would require NSE as well.

Based on my experience I believe using NSE for meta-programming interface is a bad design decision. If there is a consensus on adding NSE to env (because .() came up already) I fine about it, but I won't be codeowner of this part anymore.

Nevertheless it does not mean there is no easier way to solve your problem. There may be, haven't really looked into it. Maybe it will be easier for you to solve it just in base R and then apply on your data.table based function.

jangorecki commented 7 months ago

This is one basic use case which presents weakness (even it behaves consistently - still) of your example

data.frame(cyl=1, var=2, mpg=3) %>% 
  group_by(cyl) %>% 
  var_summary(mpg)

using env we avoid this confusion

var_summary <- function(data, var) {
  data[, .(n = .N, min=min(.var), max=max(.var)), by=cyl, env=list(.var=substitute(var))]
}

even if we don't prefix var with a dot to .var

data[, .(n = .N, min=min(var), max=max(var)), by=cyl, env=list(var=substitute(var))]

it will still be consistent because env defines evaluation environment, therefore it is masking anything inside. Also because of that we know that in case of

data.frame(cyl=1, var=2, mpg=3)

we will not end up using var column, where on the contrary using the {{ expression nesting, we end up using code which is counter-intuitive, where R developers (not to be confused with tidyverse developers) may expect completely different output.