fstpackage / synthetic

R package for dataset generation and benchmarking
GNU Affero General Public License v3.0
20 stars 1 forks source link

Random number generator with spline based distribution #39

Closed MarcusKlik closed 4 years ago

MarcusKlik commented 4 years ago

To simulate any distribution, we need a random generator that can take a custom distribution and generate random numbers from that:

MarcusKlik commented 4 years ago

The C generator rand() will only produce 2^15 (32768) unique numbers on Windows, so it doesn't satisfy the second criterium mentioned above.

Two other generators are:

Comparing them with (temporary) implementations in synthetic:

microbenchmark::microbenchmark(
  synthetic:::random_dbl_std(1e6L, 0.11),
  synthetic:::random_dbl_boost(1e6L, 0.11),
  runif(1e6L)
)
#> Unit: milliseconds
#>                                          expr      min        lq      mean
#>    synthetic:::random_dbl_std(1000000L, 0.11) 133.9503 137.84755 140.35906
#>  synthetic:::random_dbl_boost(1000000L, 0.11)  33.6345  34.45120  39.88716
#>                               runif(1000000L)  26.3969  27.18895  29.05062
#>     median        uq      max neval
#>  139.56200 142.16455 151.6480   100
#>   35.68815  37.58805 384.3284   100
#>   28.20410  29.54860  43.2920   100

So the std version is relatively slow. The boost version is almost on-par with R's runif(), so pretty fast.

Uniqueness of values:

res <- data.table::rbindlist(
  lapply(1:100, function(x) {
    list(
      std = length(unique(synthetic:::random_dbl_std(1e6L, runif(1)))),
      boost = length(unique(synthetic:::random_dbl_boost(1e6L, runif(1)))),
      runif = length(unique(runif(1e6L)))
    )
  })
)

res[, .(
  std = median(std),
  boost = median(boost),
  runif = median(runif))
]
#>      std  boost  runif
#> 1: 1e+06 999883 999883

Now we know why the std variant is slower, it spends CPU time to guarantee uniqueness of values :-).

For our purposes, the boost version is excellent: comparable performance to runif() and also comparable uniqueness of draws.

There are advantages in using a C++ generator:

MarcusKlik commented 4 years ago

Usage:

# define some control points for the distribution
control_points <- sort(rnorm(10, 2, 1) * rnorm(10, 10, 1))

# generate random numbers following above distribution
x <- rspline(6000, control_points, 0.1)

# display the control points and resulting distributions
library(ggplot2)

y <- data.frame(X = 0:(length(x) - 1) / (length(x) - 1), Y = sort(x))
z <- data.frame(X = 0:(length(control_points) - 1) / (length(control_points) - 1), Y = control_points)

ggplot(y) +
  geom_line(aes(X, Y)) +
  geom_point(aes(X, Y), data = z, color = "red", size = 3)

gives:

image

MarcusKlik commented 4 years ago

The points are the control points specified by the user to define the custom distribution. These points have to be in ascending order at the moment. The line is drawn from the 6000 points generated with rspline().

That's pretty close 😸 !