Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 977 forks source link

by = .EACHI and summing over the empty set returning NA #4582

Open sindribaldur opened 4 years ago

sindribaldur commented 4 years ago

Is the following the correct behaviour?

irisDT <- data.table(iris)
our_species <- c("setosa", "versicolor", "virginica", "pseudacorus")
irisDT[, Species := factor(Species, levels = our_species)]
setkey(irisDT, Species)
irisDT[J(our_species), .(.N, tsl = sum(Sepal.Length)), by = .EACHI]
#        Species  N   tsl
# 1:      setosa 50 250.3
# 2:  versicolor 50 296.8
# 3:   virginica 50 329.4
# 4: pseudacorus  0    NA

# Expected to get 0 not NA
sum(numeric(0)) # [1] 0; the documentation says "the sum of an empty set is zero, by definition"

# Current solution
irisDT[J(our_species), .(.N, tsl = sum(Sepal.Length)), by = .EACHI
       ][, tsl := fifelse(N>0L, tsl, 0)][]
## EDIT (cleaner solution)
irisDT[J(our_species), .(.N, tsl = sum(Sepal.Length, na.rm = TRUE)), by = .EACHI]
## EDIT ENDS

# Extra surprise
irisDT[1][J(our_species), .(.N, tsl = length(Sepal.Length)), by = .EACHI]
# It seems that we are dealing with a NA_real_ of length 1
irisDT[1][J(our_species), .(.N, tsl = is.na(Sepal.Length)), by = .EACHI]

Sorry if this has been discussed/documented elsewhere.

jangorecki commented 4 years ago

I agree that behaviour you are requesting make sense, but I think it should be more carefully reviewed before making a decision on a change. Just for completeness, if breaking down group-during-join into two operations:

irisDT[J(our_species)][, .(.N, tsl = sum(Sepal.Length)), by = Species]
#       Species     N   tsl
#        <char> <int> <num>
#1:      setosa    50 250.3
#2:  versicolor    50 296.8
#3:   virginica    50 329.4
#4: pseudacorus     1    NA
myoung3 commented 4 years ago

related is #857 and the idea of allowing nomatch to specify filling NAs to zero in merges