ben519 / mltools

Exploratory and diagnostic machine learning tools for R
Other
72 stars 26 forks source link

incorrect results from empirical_cdf? #18

Open mikoontz opened 4 years ago

mikoontz commented 4 years ago

Hello!

I'm getting unexpected results from mltools::empirical_cdf() that differ from a base R implementation as well as the Emcdf::emcdf() implementation. Am I missing something that would make the {mltools} implementation correct and the other ones wrong?

Here's a reproducible example:

library(data.table)
library(mltools)
library(Emcdf)
library(purrr)

set.seed(1235)
(data <- as.matrix(data.frame(x = rnorm(n = 5), y = rnorm(n = 5), z = rnorm(n = 5))))

The data look like this:

              x          y          z
[1,] -0.6979879 1.69819652 -0.9403661
[2,] -1.2848539 0.04784562  1.0849639
[3,]  0.9899590 0.65486241 -0.7501569
[4,]  0.1117758 1.36528367 -0.4216928
[5,]  0.1142077 0.40257296 -0.8231759

And the implementation of the ecdf looks like this:

dt <- data.table(data)
obj <- initF(data, 2)

(base_R <- pmap_dbl(dt, .f = function(x, y, z) {
  mean(data[, "x"] <= x & data[, "y"] <= y & data[, "z"] <= z)
}))
(Emcdf_package <- emcdf(obj, data))
(mltools_package <- empirical_cdf(dt, ubounds = dt)$CDF)

(results <- data.frame(base_R, Emcdf_package, mltools_package))

Which yields:

  base_R Emcdf_package mltools_package
1    0.2           0.2             0.2
2    0.2           0.2             0.2
3    0.4           0.4             0.2
4    0.2           0.2             0.2
5    0.2           0.2             0.2
ben519 commented 4 years ago

This looks like a bug. I'll work on fixing, hopefully today. Good catch!

mikoontz commented 4 years ago

awesome, thanks!

ben519 commented 4 years ago

It looks like my implementation for this function was pretty short-sighted. Can you use Emcdf::emcdf() or is it too slow for your needs?

mikoontz commented 4 years ago

ah, bummer! sorry to bear the bad news :(

Unfortunately, I do think that Emcdf::emcdf() is too slow for my needs. I see that it is written in C++, but I can't quite interpret what algorithm is actually being used under the hood (https://github.com/cran/Emcdf/blob/master/src/emcdf.cpp)

I'm currently looking at this new algorithm, which seems promising: https://link.springer.com/article/10.1007%2Fs00180-017-0771-x but I don't yet have the {data.table} skills to implement it in a way that takes advantage of that tool too. We'll see how far I get!