Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

should support DT[FLAG, ] when FLAG is a logical vector of DT #4255

Closed shrektan closed 4 years ago

shrektan commented 4 years ago

If DT contains a logical vector (call it the FLAG column), we have to use DT[FLAG==TRUE] while DT[FLAG] will raise errors. It's kind of confusing for the newbies.

Any reason that we can't support this?

library(data.table)
tbl <- data.table(FLAG = c(TRUE, FALSE), VALUE = c(1, 0))
tbl[FLAG == TRUE]
#>    FLAG VALUE
#> 1: TRUE     1
tbl[FLAG == FALSE]
#>     FLAG VALUE
#> 1: FALSE     0
tbl[FLAG]
#> Error in `[.data.table`(tbl, FLAG): FLAG is not found in calling scope but it is a column of type logical. If you wish to select rows where that column contains TRUE, or perhaps that column contains row numbers of itself to select, try DT[(col)], DT[DT$col], or DT[col==TRUE] is particularly clear and is optimized. When the first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table looks for var in calling scope.
tbl[!FLAG]
#> Error in `[.data.table`(tbl, !FLAG): FLAG is not found in calling scope but it is a column of type logical. If you wish to select rows where that column contains TRUE, or perhaps that column contains row numbers of itself to select, try DT[(col)], DT[DT$col], or DT[col==TRUE] is particularly clear and is optimized. When the first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table looks for var in calling scope.
flag <- tbl$FLAG
tbl[flag]
#>    FLAG VALUE
#> 1: TRUE     1
tbl[!flag]
#>     FLAG VALUE
#> 1: FALSE     0

Created on 2020-02-23 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.2 (2019-12-12) #> os macOS Catalina 10.15.3 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Asia/Shanghai #> date 2020-02-23 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) #> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0) #> callr 3.4.1 2020-01-24 [1] CRAN (R 3.6.0) #> cli 2.0.1 2020-01-08 [1] CRAN (R 3.6.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0) #> data.table * 1.12.8 2019-12-09 [1] CRAN (R 3.6.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) #> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.0) #> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.0) #> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.0) #> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0) #> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0) #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) #> htmltools 0.4.0.9002 2020-01-27 [1] Github (rstudio/htmltools@e07546c) #> knitr 1.27.2 2020-01-31 [1] local #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0) #> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0) #> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0) #> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0) #> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0) #> rlang 0.4.4 2020-01-28 [1] CRAN (R 3.6.2) #> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.0) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> stringi 1.4.5 2020-01-11 [1] CRAN (R 3.6.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) #> testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.0) #> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.0) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0) #> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2) #> #> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library ```
MichaelChirico commented 4 years ago

DT[(FLAG)] works though; allowing DT[FLAG] opens the Pandora's box for ambiguity -- is FLAG in DT or is it in the parent frame? what if FLAG is in both DT & parent.frame?

renkun-ken commented 4 years ago

I use DT[(FLAG)] in this case.

I think the data.table behavior/semantics should not be determined dynamically by whether the column exists in the data.table or it would be hard for script reader to guess the intention of the code because the reader needs to know if this column exists. If the semantics is quite static, it would be easier to manage and work with.

shrektan commented 4 years ago

Sorry, I don't get your guys' point. Even within (), the variable is determined dynamically. Isn't it? That's said, the variable inside () doesn't guarantee it's a column inside of DT.

library(data.table)
tbl <- data.table(VALUE = c(1, 0))
flag <- c(FALSE, TRUE)
tbl[(flag)]
#>    VALUE
#> 1:     0
tbl[, flag := !flag]
tbl[(flag)]
#>    VALUE flag
#> 1:     1 TRUE

Created on 2020-02-23 by the reprex package (v0.3.0)

renkun-ken commented 4 years ago

Sorry for not making my point clear.

My point is that we already use a lot of NSE in data table syntax by customizing the interpretation of expressions in [.data.table which is dynamic but the semantics or the meaning of the code we are trying to define is not necessarily (and in my point of view, should not) dynamically determined (not depending on if a column exists or something we only know at runtime).

Suppose we are only looking at the following code:

tbl[flag]

If the meaning or the semantics of the code depends on if flag is a column in tbl, then one has to know if tbl has flag at runtime at this point to know what the author is trying to do.

By contrast, if tbl[flag] is always evaluated in the same way without depending on whether flag is a column of tbl, then the reader of this code knows what the author is trying to do without having to know any runtime information, if the reader knows the evaluation rule. At this level, @shrektan is trying to say that the evaluation rule here might not be natural enough.

Currently, the evaluation rule of tbl[flag] is simple: if flag is a symbol then dt assumes it is an external selector and always tries to find it in the calling scope. Although the evaluation of flag follows dynamic scoping, but the semantics or how it is evaluated is predetermined and is not determined at runtime, not depending on if a column exists.

This makes the code much easier to maintain and read in complex and collaborative data project.

MichaelChirico commented 4 years ago

Agree with @renkun-ken -- in principle, we could examine whether flag is a logical column of tbl during evaluation of [, and then assume the user is trying to subset with this column;

The current behavior allows (with just two extra characters) the user to signal the intent that flag is a column unambiguously.

The former comes at the cost of (1) increasing the complexity of our codebase (2) introducing some assumptions which are hard to get around [i.e., suppose flag shouldn't be read as being a column of tbl, how can the user assert this so that we don't treat it like that?]

MichaelChirico commented 4 years ago

About whether ( forces flag to be a column of tbl -- it doesn't , but it clearly establishes the scoping rules --

shrektan commented 4 years ago

Thanks, I see your points. Since I always use UPPERCASE as the column and smallercase as the variable name to avoid the ambiguity, I ignored one possibility that is the i in DT[i] could be more things like a character vector or another data.table.

By always evaluating the i above in the external frame indeed is better to understand in such cases (as well as the maintainability).

So I'm closing this issue.

@renkun-ken @MichaelChirico Thanks for both of your answers.

ColeMiller1 commented 4 years ago

@MichaelChirico [ already checks to see if flag is a logical. That's why there's a helpful error telling users that they can use (FLAG) | dt$FLAG | FLAG == TRUE. But it is surprising behavior to a user- why does dt[FLAG & FLAG] work?

We could easily accommodate the request with adding an if else statement to the symbol evaluation branch. I compiled the below changes - only test 1773.01 is affected. edit: added else statement at the end to make it work for real on tbl[FLAG].

##########see line 366 in data.table.R
      # isub is a single symbol name such as B in DT[B]  
      i = try(eval(isub, parent.frame(), parent.frame()), silent=TRUE)
      if (inherits(i,"try-error")) {
        # must be "not found" since isub is a mere symbol
        col = try(eval(isub, x), silent=TRUE)  # is it a column name?
  #########wrap with new if statement#################
 if(typeof(col) != "logical") { #####new line 1

        msg = if (inherits(col,"try-error")) " and it is not a column name either."
        else paste0(" but it is a column of type ", typeof(col),". If you wish to select rows where that column contains TRUE",
                    ", or perhaps that column contains row numbers of itself to select, try DT[(col)], DT[DT$col], or DT[col==TRUE] is particularly clear and is optimized.")
        stop(as.character(isub), " is not found in calling scope", msg,
             " When the first argument inside DT[...] is a single symbol (e.g. DT[var]), data.table looks for var in calling scope.")
}  else { #####new
          i = col   ####new
        } ####new

Regarding scoping, the description is inconsistent with j. edit - had these reversed initially.

library(data.table)
tbl <- data.table(FLAG = c(TRUE, FALSE), VALUE = c(1, 0))
col = "surprise"
tbl[, col := 1L][]
#>      FLAG VALUE   col
#>    <lgcl> <num> <int>
#> 1:   TRUE     1     1
#> 2:  FALSE     0     1
tbl[, (col) := 1L][]
#>      FLAG VALUE   col surprise
#>    <lgcl> <num> <int>    <int>
#> 1:   TRUE     1     1        1
#> 2:  FALSE     0     1        1

tbl <- data.table(FLAG = c(TRUE, FALSE), VALUE = c(1, 0))
tbl[, (new_col) := 1L]
#> Error in eval(lhs, parent.frame(), parent.frame()): object 'new_col' not found