Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.52k stars 967 forks source link

When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

Closed CGlemser closed 9 months ago

CGlemser commented 2 years ago

I have run into the following situation (simplified for reproducibility):

Output of sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 [4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_4.1.1 cli_3.1.0 tools_4.1.1 data.table_1.14.2 rlang_1.0.

jangorecki commented 2 years ago

Maybe you can just return double type NA from your function instead of logical NA? Otherwise your function is not type stable.

CGlemser commented 2 years ago

Yes, I am aware of workarounds that will make it work and I have implemented it in my use case - I was just wondering if this is not something that could (and maybe should) be done by data.table implicitly, in particular cause we see this type of dynamic type conversion pretty much everywhere else in R. One example is if I for example use sumNA <- function(x) ifelse(all(is.na(x)), as.double(NA), sum(x, na.rm = TRUE)) but one column is type integer dt <- data.table(col1 = as.integer(c(rep(NA, 2), 1, 2)), col2 = rep(c(1,2), each = 2), groupby = c(1,1,2,2)) this again trips up data.table when calling dt[, lapply(.SD, sumNA), .SDcols = c("col1", "col2"), by = "groupby"] whereas I would usually expect R to dynamically change the typing of integer to double if needed, the same way e.g. class(as.integer(1) + as.double(1)) won't throw an error but simply return "numeric".

I am aware that this is not a huge issue and with some programming knowledge it should be easy to fix, but I find the very strong typing slightly peculiar and it might throw off users with less coding experience.

avimallu commented 2 years ago

From the discussions on Github, automatic optimization of ifelse has not been made to fifelse. Some more comments that might help shed some light:

  1. fifelse is not a complete replacement for ifelse. See discussion here.
  2. There is a PR pending that allows for this, but is not yet implemented for the reason in (1).
  3. If you want automatic typecasting to the relevant NA value, it should already be there in fifelse, implemented via this PR. It is not yet implemented for fcase, however.

My terminology might be inaccurate for some of the described operations, but I hope this is helpful.

jangorecki commented 2 years ago

Instead of coercing to double you can always use NAreal. This will not make it work well for mixed types on input, as example you asked. Best practice way to address this case is to have generic method and two methods, for double and for integers. Handling that in fifelse is another option, handling that in [ groupby call is probably not feasible because that means that for results of every group we have to check if types matches and then adjust them accordingly, which for low cardinality grouping will not be a big deal, but when there are many small groups will be definitely slowing down the aggregation.

CGlemser commented 2 years ago

Alright, thank you both for your input! Then I will simply keep type stability better in mind in future functions and switch to fifelse where possible :)

ben-schwen commented 2 years ago

Easiest option is probably to make a cast, e. g. as.numeric before returning in your function.

While as Jan pointed, it would be possible to do type checks and bump types if necessary, I think it would not really help users in the long run. Being aware of the different types of NA is something that useRs should learn imo.

jangorecki commented 9 months ago

Closing. As explained above, checking class of returned object for each group separately is too big overhead. Functions used in groupby should be type stable.