Open Fred-Wu opened 7 months ago
Did you look into manual that was linked from vignette? Example is already there from the beginning. ?substitute2 Example uses base R.
I feel this vignette is already too long and covers various cases from very simple to more difficult ones. I don't think we should try to put all to a vignette. Manual is better place for that.
Your example lacks the use of function you defined. Also I would advise to use stackoverflow to questions on usage - other people may be looking for the same question in future.
Did you look into manual that was linked from vignette? Example is already there from the beginning. ?substitute2 Example uses base R.
I feel this vignette is already too long and covers various cases from very simple to more difficult ones. I don't think we should try to put all to a vignette. Manual is better place for that.
Your example lacks the use of function you defined. Also I would advise to use stackoverflow to questions on usage - other people may be looking for the same question in future.
The most questions on stackoverflow in the past around programming on data.table
are how to wrap it in a function, and pass variables as function arguments. This already can be achieved by using substitute
/eval
.
However by looking at the "new programming interface", I am confused what's new about it, and how's it different to the old way.
The programming interface from dplyr
is easier to pass variable name as function argument like below.
var_summary <- function(data, var) {
data %>%
summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>%
group_by(cyl) %>%
var_summary(mpg)
I was then wondering whether data.table
's new interface make this easier as well?
In the past, I can just do deparse
/substitute
/eval
to pass the variables from the data to the function argument
overlay <- function(lng, lat, data, basemap) {
lng_name <- deparse(substitute(lng))
lat_name <- deparse(substitute(lat))
data_name <- deparse(substitute(data))
geo <- substitute(list(lng, lat))
n_site <- dim(data)[1]
boundary_name <- grep("CODE|MAIN", toupper(names(basemap)), value = TRUE)
site_over_map <- data[, eval(geo)] %>%
.[, eval(boundary_name) := as.character(NA)]
coord <- st_as_sf(site_over_map[, 1:2], coords = 1:2, crs = st_crs(basemap))
overlay_ <- st_intersects(coord, basemap)
empty2NA_ <- sapply(overlay_, function(x) ifelse(length(x) == 0, NA, x))
code <- sapply(empty2NA_, function(x) {
ifelse(is.na(x), NA, basemap[[boundary_name]][x])
})
site_over_map[, 3] <- code
return(site_over_map)
}
However, from other's explorations, the new interface is not too different from what can be done previously.
Therefore, I was asking how the new programming interface could be used when data.base
is to be used in a function, while function argument is variable names, whether such examples could be provided in the vignette so that people could know how to use them correctly.
The example in the original post is to pass several variable names as a function argument that to be used in by
, this is also quite common in my user case. However, it still requires to use eval
and substitute
twice.
If your function takes symbols which then need to pass downstream as symbols then you have to use substitute
. This is not related to data.table but to Standard Evaluation rules in base R.
Use of Standard Evaluation rather than NSE in the new env
argument was design decision of mine. Therefore you cannot expect env
argument to get symbols that are providing its value. For the same reason .()
will not work as alias to list()
in env
arg as it would require NSE as well.
Based on my experience I believe using NSE for meta-programming interface is a bad design decision.
If there is a consensus on adding NSE to env
(because .()
came up already) I fine about it, but I won't be codeowner of this part anymore.
Nevertheless it does not mean there is no easier way to solve your problem. There may be, haven't really looked into it. Maybe it will be easier for you to solve it just in base R and then apply on your data.table based function.
This is one basic use case which presents weakness (even it behaves consistently - still) of your example
data.frame(cyl=1, var=2, mpg=3) %>%
group_by(cyl) %>%
var_summary(mpg)
using env
we avoid this confusion
var_summary <- function(data, var) {
data[, .(n = .N, min=min(.var), max=max(.var)), by=cyl, env=list(.var=substitute(var))]
}
even if we don't prefix var
with a dot to .var
data[, .(n = .N, min=min(var), max=max(var)), by=cyl, env=list(var=substitute(var))]
it will still be consistent because env
defines evaluation environment, therefore it is masking anything inside. Also because of that we know that in case of
data.frame(cyl=1, var=2, mpg=3)
we will not end up using var
column, where on the contrary using the {{
expression nesting, we end up using code which is counter-intuitive, where R developers (not to be confused with tidyverse developers) may expect completely different output.
The current "Programming on data.table" vignette does not provide any examples on how the new programming interface should be used within a function when passing multiple data variables to a function is required.
By looking at the examples in a recent article by John MacKintosh, it seems that
eval
/substitute
still needs to be used multiple times, which is not so different to the version before, whileeval
is suggested as a retired interface now.Is this the intended usage of the new programing interface?
Could more examples be provided in the vignette, especially on how it should be used within a function?