I get an error when running the mltools:empirical_cdf() function on a wild-caught data set-- I think related to there being some repeat values in the columns? (though I can't always reproduce the error when there are some repeats!)
Is this a fundamental limitation of the algorithm? Or is there something else more insidious that might be at work?
Here's a reproducible example (and the location in the function code where it seems to break):
library(data.table)
library(mltools)
set.seed(123)
data <- as.matrix(data.frame(x = c(rep(0, 4), 1), y = c(rep(0, 2), 4, 2, 1), z = rnorm(n = 5)))
dt <- data.table(data)
The data look like this:
> dt
x y z
1: 0 0 -0.56047565
2: 0 0 -0.23017749
3: 0 4 1.55870831
4: 0 2 0.07050839
5: 1 1 0.12928774
And the error (from {data.table}) looks like this:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 17 rows; more than 10 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
Copypasting your function code to my script and (naively!) adding "allow.cartesian=TRUE" to the rolling join line gives CDF results greater than 1, so it doesn't seem to be a super simple fix.
Is this just a limitation of this particular algorithm?
I get an error when running the
mltools:empirical_cdf()
function on a wild-caught data set-- I think related to there being some repeat values in the columns? (though I can't always reproduce the error when there are some repeats!)Is this a fundamental limitation of the algorithm? Or is there something else more insidious that might be at work?
Here's a reproducible example (and the location in the function code where it seems to break):
The data look like this:
Implementing the ecdf looks like this:
And the error (from {data.table}) looks like this:
This appears to break during the rolling join (https://github.com/ben519/mltools/blob/e952f3b06d035502c16d090a64f54759f0dc94ab/R/empirical_cdf.R#L99).
Copypasting your function code to my script and (naively!) adding "allow.cartesian=TRUE" to the rolling join line gives CDF results greater than 1, so it doesn't seem to be a super simple fix.
Is this just a limitation of this particular algorithm?