Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Assigning some values to randomly selected rows leads to incorrect results #5703

Open huzefaKhalil opened 1 year ago

huzefaKhalil commented 1 year ago

I needed to change the value in cells of rows selected at random but doing so gives incorrect results. There is no error, just the result is incorrect, hence it was particularly hard to track this down when dealing with thousands of row entries.

library(data.table)

options(datatable.verbose = TRUE)

set.seed(123)

dt <- data.table(
  ID = 1000 + c(1:10),
  Event = sample(c(T, F), 10, replace = T)
)

# choose an ID at random to change to FALSE
nToChange <- 1
dt[ID %in% sample(dt[Event == TRUE]$ID, nToChange)]$Event <- FALSE
dt

      ID Event
 1: 1001  TRUE
 2: 1002  TRUE
 3: 1003  TRUE
 4: 1004 FALSE
 5: 1005  TRUE
 6: 1006 FALSE
 7: 1007 FALSE
 8: 1008 FALSE
 9: 1009  TRUE
10: 1005 FALSE

As you can see above, the ID for 1010 has been changed to 1005!

Selecting the rows to change outside the [] gives correct results.


set.seed(123)

dt <- data.table(
  ID = 1000 + c(1:10),
  Event = sample(c(T, F), 10, replace = T)
)

# choose an ID at random to change to FALSE
nToChange <- 1
idsToChange <- sample(dt[Event == TRUE]$ID, nToChange)
dt[ID %in% idsToChange]$Event <- FALSE
dt

      ID Event
 1: 1001  TRUE
 2: 1002  TRUE
 3: 1003  TRUE
 4: 1004 FALSE
 5: 1005 FALSE
 6: 1006 FALSE
 7: 1007 FALSE
 8: 1008 FALSE
 9: 1009  TRUE
10: 1010  TRUE

The verbose output from the first data.table assignment is given below.

Creating new index 'Event'
Creating index Event done in ... forder.c received 10 rows and 2 columns
forder took 0 sec
0.040s elapsed (0.039s cpu) 
Optimized subsetting with index 'Event'
forder.c received 1 rows and 1 columns
forder took 0 sec
x is already ordered by these columns, no need to call reorder
i.Event has same type (logical) as x.Event. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Creating new index 'ID'
Creating index ID done in ... forder.c received 10 rows and 2 columns
forder took 0 sec
0.033s elapsed (0.032s cpu) 
Optimized subsetting with index 'ID'
forder.c received 1 rows and 1 columns
forder took 0 sec
x is already ordered by these columns, no need to call reorder
i.ID has same type (double) as x.ID. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Assigning to all 1 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because NAMED==5 MAYBE_SHARED==1, but then is being plonked. length(values)==1; length(cols)==1)
Optimized subsetting with index 'Event'
forder.c received 1 rows and 1 columns
forder took 0 sec
x is already ordered by these columns, no need to call reorder
i.Event has same type (logical) as x.Event. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Assigning to 1 row subset of 10 rows
RHS_list_of_columns == true
Dropping index 'Event' due to an update on a key column
Dropping index 'ID' due to an update on a key column

My sessionInfo() is below.


R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0
jangorecki commented 1 year ago

Are you able to reproduce the problem if you assign using :=? Probably index is invalid after <-

huzefaKhalil commented 1 year ago

:= works fine.

dt[select, "col"] <- FALSE works fine as well.

Only dt[select]$col <- FALSE gives an incorrect result without an error.

AngelFelizR commented 1 year ago

dt[select]$col <- FALSE

It is a bad practice and we need a error to stop it if possible, otherwise we just need to document this behavior in the don'ts wiki

ben-schwen commented 1 year ago

dt[select]$col <- FALSE

It is a bad practice and we need a error to stop it if possible, otherwise we just need to document this behavior in the don'ts wiki

Nah, it should just work since DT is an extension of DF

AngelFelizR commented 1 year ago

I think the data.frame's sintax would be

dt$col[select] <- FALSE
ben-schwen commented 10 months ago

Dived deeper into this problem. The real problem is the subsequent call on [<-.data.table which evaluates isub = ID %in% sample(dt[Event == TRUE]$ID, nToChange). However, before getting called in "[<-", ID %in% sample(dt[Event == TRUE]$ID, nToChange) will already have been evaluated once for the value of "[<-" and therefore the next evaluation will take the next random sample.

tldr; we do something similar to eval(expression(sample(10))) twice which gives (now not surprisingly) two different random samples. I do not see how we could fix or improve this besides triggering a warning when we spot a sample in i, but ofc this applies to every non-deterministic function evaluation.

We also have the long standing comment since 2016 # TO DO: warning("Please use DT[i,j:=value] syntax instead of DT[i,j]<-value, for efficiency. See ?':='")