Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 981 forks source link

:= adds factor-levels silently #3922

Open jaapwalhout opened 5 years ago

jaapwalhout commented 5 years ago

Consider the following example data:

df <- data.frame(x = rep(c("A","B"), 2), y = 1:4)
dt <- as.data.table(df)

Now when I want to change the 3rd value of x in df to C, I will get an error with the dataframe:

> df$x[df$y == 3] <- "C"
Warning message:
In `[<-.factor`(`*tmp*`, df$y == 3, value = c(1L, 2L, NA, 2L)) :
  invalid factor level, NA generated

However when I want to this with the datable using := to update, there is no error-message (or a warning) and C is added as a factor level.

> str(dt$x)
 Factor w/ 2 levels "A","B": 1 2 1 2
> dt[y == 3, x := "C"][]
   x y
1: A 1
2: B 2
3: C 3
4: B 4
> str(dt$x)
 Factor w/ 3 levels "A","B","C": 1 2 3 2

To me this is unexpected behavior. I expected an error like when using a dataframe. When this is intended behavior, a warning would be appropriate.


Possibly related to #2403


Session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] nl_NL.UTF-8/nl_NL.UTF-8/nl_NL.UTF-8/C/nl_NL.UTF-8/nl_NL.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bench_1.0.4           microbenchmark_1.4-7  data.table_1.12.2     ggplot2_3.2.1         anytime_0.3.6         splitstackshape_1.4.8 zoo_1.8-6            
 [8] lubridate_1.7.4       purrr_0.3.2           readr_1.3.1           tibble_2.1.3          tidyr_1.0.0           dplyr_0.8.3          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       pillar_1.4.2     compiler_3.6.1   tools_3.6.1      zeallot_0.1.0    lifecycle_0.1.0  gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.3  rlang_0.4.0     
[11] rstudioapi_0.10  withr_2.1.2      stringr_1.4.0    vctrs_0.2.0      hms_0.5.1        grid_3.6.1       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         magrittr_1.5    
[21] backports_1.1.4  scales_1.0.0     assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    crayon_1.3.4
MichaelChirico commented 5 years ago

could you try on current master? there was just a big change related to assign that affected factor levels as well that may have fixed this? my laptop is running some heavy stuff at the moment or I'd gladly check

Henrik-P commented 5 years ago

At least the behaviour seems consistent with the help text for :=:

When LHS is a factor column and RHS is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods.

jaapwalhout commented 5 years ago

@MichaelChirico Would like to, but currently not in the situation I can.

Thx @Henrik-P, I should have checked.

As it is thus intended behavior, it would be appropriate (imho) to give a warning instead of silently updating the factor levels. Now it can catch people by surprise and can therefore have unintended consequences (as it did for me).

jangorecki commented 4 years ago

IMO raising warning is not really good solution. This is a feature over data.frame way, well documented for long time. So if suddenly we will start to raise warning about it I don't think it will be nice for users, that rely on it. And I can imagine that more people rely on that convenient feature. Best way might be not to add a wrapper that checks all(values %chin% levels(col)) before attempting to assign.