Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

Aliasing issue with `:=` affecting a different column #5400

Open aquasync opened 2 years ago

aquasync commented 2 years ago

The below triggers the bug for me (note the assignment to col2 changing the value of col1!):

library(data.table)

coalesce = function(x, ...) {
  for (y in list(...)) {
    idx = is.na(x)
    x[idx] = if (length(y) != 1) y[idx] else y
  }
  x
}

dt = data.table(id=1:64, col1=0, col2=0)
print(dt[1, .(col1, col2)])
#    col1 col2
# 1:    0    0
dt[, col1 := coalesce(col2, 111)]
dt[, col2 := 999]
print(dt[1, .(col1, col2)])
#    col1 col2
# 1:  999  999

And my sessionInfo() output:

R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.2

loaded via a namespace (and not attached):
[1] compiler_4.0.5 tools_4.0.5   

Basically it looks like col1 and col2 end up pointing at the same vector such that := modifies them both; I'm guessing they are shared but the reference counts are off such that := thinks it is safe to modify in-place. Not 100% clear to me if the actual underlying bug may be base R or data.table.

When trying to put together a minimal repro, I noticed a few different changes that make this bug disappear:

tlapak commented 2 years ago

Thanks for reporting. You've found an interesting way to get R to return the result in an ALTREP wrapper. This is an optimisation where R avoids allocating memory (e.g. when you type 1:64 R doesn't actually allocate 64 integers immediately). In this case, R produces a wrapper that just points to the other column. As you have noticed, touching the column in any way gets this expanded and the effect disappears. As data.table circumvents many R mechanisms to achieve its efficiency, it goes to some lengths to catch these cases but currently misses this one.

aquasync commented 2 years ago

Wow thanks @tlapak, incredibly quick diagnosis. Would have never expected ALTREP to be the issue either.