Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.55k stars 970 forks source link

Test case(s) for fastmean.c based on mutation testing results; Base R's mean vs data.table's fast mean #6176

Open Anirban166 opened 2 months ago

Anirban166 commented 2 months ago

I'm trying to create a test that will fail for a mutant of fastmean.c that involves an arithmetic operator change as discovered/mentioned in #6114 but I'm stuck on creating a viable test case for it as everything I've tried so far returns the same result for both cases: (or for both with and without this change)

s += t/n; // original
s += t*n; // mutant

Also, I'm wondering if my procedure in going about this so far is correct or not - Initially, I went about recompiling that C code via R CMD SHLIB fastmean.c (gcc with a path spec for the R header works too, but this was faster) after making changes to that file to get the shared object and then loaded that .so into R via dyn.load() with a wrapper to call the C function via .Call, given that there isn't an exported object in the data.table namespace for fast mean directly.

I then found a better/faster way to do it after doing a search to see it's usage in R code:

grep -r "fastmean" R/ tail -n 1
R//data.table.R:    return(call(".External",quote(Cfastmean),expr[[2L]], expr[[3L]]))  # faster than .Call

I switched to using .External to call that C routine via the "Cfastmean" symbol and that's how I'm currently running fastmean.c in my tests with REALSXP type inputs for that execution path, but I haven't found success in being able to create a suitable test for it that would fail for the mutant and pass for the original code. (I'm trying to come up with something similar to the other test case introduced based on mutation testing results - #6115, where the relation operator replacement triggers the condition with empty keys, leading to incorrect behaviour by executing code that assumes there is a single key)

Related testing - I was able to create test cases for which the output of base::mean did not match with the Cfastmean computed result on some checks though, like the one below:

library(data.table)

meanComparison <- function(x, ...) 
{
    baseR <- mean(x, ...)
    fastmean <- .External("Cfastmean", x, ...)
    cat("Results as computed by:\nBase R's mean:", baseR, "\ndata.table's fast mean:", fastmean, "\n")
    fifelse(identical(baseR, fastmean), "Passed", "Failed")
}

maxDoubleValue <- .Machine$double.xmax
testInputs <- c(maxDoubleValue, maxDoubleValue, -maxDoubleValue, -maxDoubleValue)
cat(meanComparison(testInputs), "the mean comparison test with maximum-value doubles as inputs.\n")
Results as computed by:
Base R's mean: 0 
data.table's fast mean: Inf 
Failed the mean comparison test with maximum-value doubles as inputs.

A corner case, but thought to share this since among the starting comments in fastmean.c I saw "We explicitly test that fastmean returns the same result as base::mean in test.data.table()." (which like this case, I assume might not be necessarily true for certain inputs or ditto given some possible inconsistency in handling such values wrt base R's mean implementation?)

tdhock commented 2 months ago

you should not be using R CMD SHLIB use R CMD INSTALL path/to/data.table

MichaelChirico commented 2 months ago

Hi @Anirban166, note that this function is only triggered under options(datatable.optimize=1). I added Rprintf("Here.\n"); in the line just before the suggested mutation and got it to print with the following:

options(datatable.optimize=1)
as.data.table(iris)[,mean(Sepal.Length, na.rm=TRUE),by=Species,verbose=TRUE]
tdhock commented 2 months ago

in that case (need to set optimize=1) the mutant is probably not interesting

MichaelChirico commented 2 months ago

in that case (need to set optimize=1) the mutant is probably not interesting

how do you figure? here's an example of it being used in the wild:

https://github.com/jessilyn/wearables_vitalsigns/blob/01fdd83ee5d74b1dd800420555152fa9a31310e0/misc/scripts/figure1.R#L163

tdhock commented 2 months ago

ok

Anirban166 commented 2 months ago

Hi @Anirban166, note that this function is only triggered under options(datatable.optimize=1).

Thanks for this insight! (quite helpful as it saved me some time in not pursuing the wrong cases when testing earlier)

I'm gonna close this issue for now as it looks like fast mean being called in this manner is indeed equivalent to base R's mean in terms of output as for every test case I've tried they both have the same results (which is good, but then that function is also not showing any change for the mutated version which is weird).

Anirban166 commented 2 months ago

@MichaelChirico I've got two questions for you here:

MichaelChirico commented 2 months ago

it looks like fast mean being called in this manner is indeed equivalent to base R's mean in terms of output as for every test case I've tried they both have the same results

That's great, but the mutation test tells us that some coverage is missing (indeed I found the same in print debugging -- those lines are not executed by our test suite). So I would instead close this by adding some of those tests as regression tests.

MichaelChirico commented 2 months ago

What is the main difference when I use Cfastmean via the .External function vs fast mean via use of mean within a data.table with datatable.optimize=1? (I assume the function being called under the hood is different? or do both approaches use fastmean.c in some way?)

IIUC they are the same:

https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/R/data.table.R#L1984-L1992 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/R/data.table.R#L1798

If you were to test changes in data.table's C code or a file in /src here, what would be your procedure in doing so?

Feel like I'm missing something -- what's wrong with adding new test()s in inst/tests/tests.Rraw?

Anirban166 commented 2 months ago

What is the main difference when I use Cfastmean via the .External function vs fast mean via use of mean within a data.table with datatable.optimize=1? (I assume the function being called under the hood is different? or do both approaches use fastmean.c in some way?)

IIUC they are the same:

https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/R/data.table.R#L1984-L1992

But isn't that why you shared above that fast mean is only called when options(datatable.optimize=1)? (because it's not the same as simply using Cfastmean with an .External call?)

Also, take for example the test I shared to compare, or with values along the lines of .Machine$double.xmax:

library(data.table)
options(datatable.optimize=1)

meanComparison <- function(x, ...) 
{
  baseR <- mean(x, ...)
  fastmean <- .External("Cfastmean", x, ...)
  cat("Results as computed by:\nBase R's mean:", baseR, "\ndata.table's fast mean:", fastmean, "\n")
  fifelse(identical(baseR, fastmean), "Passed", "Failed")
}

testInputs <- list(
  c(rep(1e308, 1e3), rep(-1e308, 1e3)),
  c(rnorm(1e6, mean = 0, sd = 1e5), rep(1, 1e6), .Machine$double.xmax)
)

for(i in seq_along(testInputs))
{
  cat("Test case ", i, ":\n", sep = "")
  cat(meanComparison(testInputs[[i]], na.rm = TRUE), "\n")
}
Test case 1:
Results as computed by:
Base R's mean: -8.147083e+291 
data.table's fast mean: Inf 
Failed 
Test case 2:
Results as computed by:
Base R's mean: 8.988461e+301 
data.table's fast mean: -1.508633e+304 
Failed 

vs

library(data.table)
options(datatable.optimize=1)

meanComparison <- function(x, ...) 
{
  baseR <- mean(x, ...)
  dt <- as.data.table(x)
  fastmean <- dt[, mean(x, ...), verbose = TRUE]
  cat("Results as computed by:\nBase R's mean:", baseR, "\ndata.table's fast mean:", fastmean, "\n")
  fifelse(identical(baseR, fastmean), "Passed", "Failed")
}

testInputs <- list(
  c(rep(1e308, 1e3), rep(-1e308, 1e3)),
  c(rnorm(1e6, mean = 0, sd = 1e5), rep(1, 1e6), .Machine$double.xmax)
)

for(i in seq_along(testInputs))
{
  cat("Test case ", i, ":\n", sep = "")
  cat(meanComparison(testInputs[[i]], na.rm = TRUE), "\n")
}
Test case 1:
Detected that j uses these columns: [x]
Results as computed by:
Base R's mean: -8.147083e+291 
data.table's fast mean: -8.147083e+291 
Passed 
Test case 2:
Detected that j uses these columns: [x]
Results as computed by:
Base R's mean: 8.988461e+301 
data.table's fast mean: 8.988461e+301 
Passed 
Anirban166 commented 2 months ago

Feel like I'm missing something -- what's wrong with adding new test()s in inst/tests/tests.Rraw?

Nothing wrong with that..but I wanted to see my tests fail for the mutated C code before I add such a test case to inst/tests/tests.Rraw to verify I'm doing it correctly so I was trying things like recompiling the C file and installing the package time and again to run my tests after changing to the mutated version of the C code I'm testing, and then back to the original version to see if the there is a difference in the test results (and I do think there are better ways to do that so I was wondering what you would do so I learn or could go for the same)

MichaelChirico commented 2 months ago

But isn't that why you shared above that fast mean is only called when options(datatable.optimize=1)? (because it's not the same as simply using Cfastmean with an .External call?)

Sorry, I mean that typically DT[, mean(x), by=y] will run gmean, the GForce-optimized mean, since datatable.optimize=2 by default, whereas datatable.optimize=0 will just turn optimization off completely. So I'm emphasizing that to get the Cfastmean routine, you have to be intentional about which optimization level you set.

MichaelChirico commented 2 months ago

fastmean <- dt[, mean(x, ...), verbose = TRUE]

Without by=, Cfastmean won't be invoked.

joshhwuu commented 3 weeks ago

AFAICT, isn't t the sum of the differences from the mean which is always 0? Barring some imprecision when working with numbers with a lot of digits. Another thing I noticed was that t is a long double when added to s, but s is casted to double when assigned to ans and returned, therefore wouldn't we lose any sort of precision adjustment from s += t/n anyways?

MichaelChirico commented 3 weeks ago

Barring some imprecision when working with numbers with a lot of digits.

Pretty sure that's exactly what's happening :)

See also https://www.brodieg.com/2019/02/24/a-strategy-for-faster-group-statisitics/ and https://www.brodieg.com/2019/07/24/hydra-reformulate/. The former links the base R mean code which does basically the same: https://github.com/wch/r-source/blob/tags/R-3-5-2/src/main/summary.c#L485