Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 981 forks source link

unexpected effects of setindex vs. setkey #5321

Open sukort opened 2 years ago

sukort commented 2 years ago

DT <- data.table(a = c(TRUE, FALSE), b = 1:2)

setindex(DT, a) DT[ .(TRUE), ] # ERROR

setkey(DT, a) DT[ .(TRUE), ] # 1: TRUE

Same with column b. Neither in the documentation nor in the vignettes exists any hint on the varying behavior. setindex() works only with character valued columns? Please add some hints in the documentation of your great package.

sessionInfo() R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye)

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
[3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.0.4

shrektan commented 2 years ago

This is expected.

The i argument is .(TRUE) in your example.

.(TRUE) equals to list(TRUE) and will be converted to a data.table internally. It means your example equals to DT[data.table(TRUE),]. When this is the case, an extra on argument is needed or the data.table should be keyed.

You can find the documentation via ?data.table::data.table.

截屏2022-01-31 23 19 13
sukort commented 2 years ago

Thank you very much for your very quick answer! I see now the point. Don't know why I mixed up these things!

MichaelChirico commented 2 years ago

please share the error message you're getting, and let us know if there's something we can improve about the message that was unclear

sukort commented 2 years ago

Thank you for your request. After reading the vignette Secondary indices and auto indexing I got the impression that the main difference of setindex and setkey would be in skipping the ordering step but usage would be quite similar. I thought of a discrepancy of error message and vignette. E.g.:

Secondary indices are similar to keys in data.table, except for two major differences:

  • It doesn’t physically reorder the entire data.table in RAM. Instead, it only computes the order for the set of columns provided and stores that order vector in an additional attribute called index.

After reading it again and again I see clearer the point of just providing an index with setindex and the necessity to use it with the on argument. Maybe one could mention the hint from ?data.table::data.table (see above by shrektan) to make the vignette a bit more clear?

It confused me also a bit that keys and indices are possible at the same time and that the on argument can also be used with keys. So what happens in the background? Is there a hierarchical procedure in looking at first for keys then for existing secondary indices and if nothing of them can be found with the provided column name(s) a temporary one is created?

Btw: To understand more the difference between keys and indices I made some small simulations. These experiments (maybe not representative) showed me a small time benefit of setkey over setindex if the key or the index is already set. Again if the index is set it shows a clear time benefit of setindex over usage of only temporary secondary indices.

n <- 8L
m <- 20L
xxx <- data.table(a=sample(1:10, 10^n, T))
profvis::profvis({
    setkey(xxx, a)
    replicate(m, {
    xxx[.(1L), on = "a"]
    })
})
setkey(xxx, NULL)

profvis::profvis({
    setindex(xxx, a)
    replicate(m, {
    xxx[.(1L), on = "a"]
    })
})
setindex(xxx, NULL)

profvis::profvis({
    replicate(m, {
    xxx[.(1L), on = "a"]
    })
})

In my opinion it contradicts a bit the hint in the vignette:

Since the time to compute the secondary index is quite small, we don’t have to use setindex(), unless, once again, the task involves repeated subsetting on the same column.

Maybe one could provide 3 small examples for direct comparison of setkey, setindex and temporary sec. index in the vignette?