SebKrantz / collapse

Advanced and Fast Data Transformation in R
https://sebkrantz.github.io/collapse/
Other
644 stars 33 forks source link

Inconsistency between `fmean` and `fsum` #628

Open NicChr opened 2 weeks ago

NicChr commented 2 weeks ago

Hello, firstly great package again as always! Just wanted to highlight a few inconsistencies and possibly an erroneous result.

I noticed that when calling fmean() on an integer vector with NA values and groups, the result doesn't return NA as one might expect. I suspect it has something to do with the integer overflow issue mentioned in the documentation, but I'd at least expect a similar erroneous result with fsum which I'm not seeing there.

I also noticed that the length of the result returned by fsum() doesn't match that of fmean() in the case of integer vectors.

In any case I've attached some simple examples to demonstrate the behaviour I'm seeing.

library(collapse)
#> collapse 2.0.16, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D

fsum(c(1L, NA), na.rm = F) #Ok
#> [1] NA
fmean(c(1L, NA), na.rm = F) #Ok
#> [1] NA

fsum(c(1L, NA), na.rm = F, g = c(1L, 1L)) #Ok
#>  1 
#> NA
fmean(c(1L, NA), na.rm = F, g = c(1L, 1L)) #Not ok
#>           1 
#> -1073741824

fsum(integer()) #Ok
#> integer(0)
fmean(integer()) #Not ok
#> [1] NA

fsum(numeric()) #Ok
#> numeric(0)
fmean(numeric()) #Ok
#> numeric(0)

fsum(integer(), na.rm = F, g = integer()) #Ok
#> named integer(0)
fmean(integer(), na.rm = F, g = integer()) #Not ok
#> <NA> 
#>   NA

fsum(numeric(), na.rm = F, g = integer()) #Ok
#> named numeric(0)
fmean(numeric(), na.rm = F, g = integer()) #Ok
#> named numeric(0)

Created on 2024-08-26 with reprex v2.1.0

SebKrantz commented 2 weeks ago

Thanks! So actually all more involved numerical statistics (fmean, fmedian, fvar, fsd, fquantile) give NA when passed integer(), which is consistent with base R. What is inconsistent is that fsum(),fmin, and fmax() return integer(), instead of 0, Inf, -Inf. I can think about this for a while. I guess I didn't go with base R here because I did not like these defaults.

And yeah, fmean() for integers with groups currently does not do any checks if na.rm = FALSE. which gives a negative number here because NA_integer_ is the smallest integer in C. I'll see if I can fix that one.