Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.61k stars 985 forks source link

Spurious warning when grouping and aggregating with min() or max() for a query that returns an empty table #5626

Closed dvg-p4 closed 1 year ago

dvg-p4 commented 1 year ago

Note to future searchers:

This behavior is the result of data.table running the aggregation function at least once even on an empty table, which is intentional and good--see discussion below. The warning message is from base R, when min is run on an empty list (as it must be to consistently generate correctly-typed empty columns). It can be suppressed with suppressWarnings([dt call]) or options(warn = -1) if needed.

Description

If a data.table query uses by, and an aggregate expression that uses min or max, and returns zero rows, there will be a warning message printed along the lines of "no non-missing arguments to min; returning Inf". This is spurious, IMO, since Inf is not actually being returned--the query is returning an empty table, as expected.

The actual return values are consistent with behavior for non-empty results, but the warning is annoying--I have code that runs in a giant loop (calculating values for ~thousands of columns), and about a hundred of those have an empty aggregate table in an intermediate part of the calculation, so I get spammed with ~ a hundred warnings when my code runs successfully.

Minimal reproducible example

More realistic case

> library(data.table)
> mydt <- data.table(foo = c(1,1,2,2,2,3), bar = c(0,1,0,1,2,0), baz = c(4,2,2,5,3,8))
> mydt
   foo bar baz
1:   1   0   4
2:   1   1   2
3:   2   0   2
4:   2   1   5
5:   2   2   3
6:   3   0   8
> mydt[bar > 0, min(baz), by = foo]
   foo V1
1:   1  2
2:   2  3

As expected, only entries for values of foo that have at least one case of bar > 0

> mydt[bar > 3, min(baz), by = foo]
# As expected as
Empty data.table (0 rows and 2 cols): foo,V1
Warning message:
In min(baz) : no non-missing arguments to min; returning Inf

Consistent with the above behavior, an empty table (no rows match the criterion, so there are no values of foo to aggregate over). However, there is a warning message displayed.

Simpler, less-realistic example

> empty_dt <- data.table(foo = numeric(), bar = numeric())
> empty_dt[, min(bar)]
[1] Inf
Warning message:
In min(bar) : no non-missing arguments to min; returning Inf

Warning message is definitely appropriate here--Inf was actually returned.

> empty_dt[, min(bar), by = foo]
Empty data.table (0 rows and 2 cols): foo,V1
Warning message:
In min(bar) : no non-missing arguments to min; returning Inf

However it is spurious here--the degenerate case of an empty table is returned, which does not include any Inf values.

Output of sessionInfo()

On my laptop

R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_4.2.2 tools_4.2.2

On our linux box

R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_3.6.0 tools_3.6.0
avimallu commented 1 year ago

AFAIK, that's an output from base R:

> min(c(NA, NA, NA), na.rm=TRUE)
[1] Inf
Warning message:
In min(c(NA, NA, NA), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf

I'm not sure how data.table can handle it. I reckon that if there are no rows, then no operation should be processed, but then a deliberate action that creates specific columns (say for an rbindlist operation later) will not be created; which makes this seem inescapable. For the short term, you should be able to suppress those warnings with

> suppressWarnings(min(c(NA, NA, NA), na.rm=TRUE))
[1] Inf

or other methods described here.

kennedymwavu commented 1 year ago

Hi @dvg-p4, I understand that the warning message may seem spurious, but I believe it is actually expected behavior from min() when there are no non-missing arguments.

As documented in ?min, min() returns Inf when applied to an empty set of numeric values to ensure transitivity, such as in the case of min(x1, min(x2)) == min(x1, x2).

In short, it is not an issue with {data.table}.

Here are some examples for clarification:

min()
#> [1] Inf
#> Warning message:
#> In min() : no non-missing arguments to min; returning Inf
min(numeric(0))
#> [1] Inf
#> Warning message:
#> In min(numeric(0)) : no non-missing arguments to min; returning Inf
min(NA, na.rm = TRUE)
#> [1] Inf
#> Warning message:
#> In min(NA, na.rm = TRUE) : no non-missing arguments to min; returning Inf

If you really need to suppress the warnings during the "giant loop" you can use this suggestion:

options(warn = -1) # ignore warnings
min() # your code here
options(warn = 0) # reset 'warn'

I hope this explanation clarifies the behavior you are seeing.

dvg-p4 commented 1 year ago

Thanks for the explanations! Looking into the source code I think I'm understanding a bit better what's going on--data.table intentionally runs the function once even on an empty table/subset, in order to create the correct output column structure: https://github.com/Rdatatable/data.table/blob/bbe41642a23d34b1cc491e3ff64d124c0b3ea3bd/src/dogroups.c#L173 So, if I'm getting the gist of this, always running the function at least once (which will necessarily produce a warning message like this for aggregation functions that warn on an empty input) is a feature, not a bug, which allows correctly-typed empty columns to be returned by something like this:

> mydt[, .(avg = mean(foo), min = min(foo), n = length(foo), char = paste(foo, collapse = ",")), by = bar] |> str()
Classes ‘data.table’ and 'data.frame':  0 obs. of  5 variables:
 $ bar : num 
 $ avg : num 
 $ min : num 
 $ n   : int 
 $ char: chr 
 - attr(*, ".internal.selfref")=<externalptr> 
Warning message:
In min(foo) : no non-missing arguments to min; returning Inf

This seems reasonable and better than any alternatives I can think of, so I'll close this ticket.