ben519 / mltools

Exploratory and diagnostic machine learning tools for R
Other
72 stars 26 forks source link

too many repeats breaks the empirical_cdf algorithm? #19

Open mikoontz opened 4 years ago

mikoontz commented 4 years ago

I get an error when running the mltools:empirical_cdf() function on a wild-caught data set-- I think related to there being some repeat values in the columns? (though I can't always reproduce the error when there are some repeats!)

Is this a fundamental limitation of the algorithm? Or is there something else more insidious that might be at work?

Here's a reproducible example (and the location in the function code where it seems to break):

library(data.table)
library(mltools)

set.seed(123)
data <- as.matrix(data.frame(x = c(rep(0, 4), 1), y = c(rep(0, 2), 4, 2, 1), z = rnorm(n = 5)))
dt <- data.table(data)

The data look like this:

> dt
   x y           z
1: 0 0 -0.56047565
2: 0 0 -0.23017749
3: 0 4  1.55870831
4: 0 2  0.07050839
5: 1 1  0.12928774

Implementing the ecdf looks like this:

(mltools_package <- empirical_cdf(dt, ubounds = dt)$CDF)

And the error (from {data.table}) looks like this:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 17 rows; more than 10 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

This appears to break during the rolling join (https://github.com/ben519/mltools/blob/e952f3b06d035502c16d090a64f54759f0dc94ab/R/empirical_cdf.R#L99).

Copypasting your function code to my script and (naively!) adding "allow.cartesian=TRUE" to the rolling join line gives CDF results greater than 1, so it doesn't seem to be a super simple fix.

Is this just a limitation of this particular algorithm?