fstpackage / synthetic

R package for dataset generation and benchmarking
GNU Affero General Public License v3.0
19 stars 1 forks source link

Numeric simulation takes levels into account #40

Open MarcusKlik opened 4 years ago

MarcusKlik commented 4 years ago

A numeric vector can have a limited amount of levels that are replicated:

# 10 'levels'
vec_levels <- rnorm(10, 10, 2)

# 1000 values
vec <- sample(vec_levels, 1000, replace = TRUE)

When do we regard the vector as a factor with numeric levels. How do we determine if these levels have a distribution of their own?

MarcusKlik commented 4 years ago

synthetic uses spline based numerical simulation based on 100 control points. So if we have less than 100 numerical levels (but many values), it's more efficient to just store these levels.

On the other hand, the number of levels might depend on the vector size. With the example above we could say, for example, that the number of levels is 1 percent of the number of values. And these levels have a distribution of their own...

two entirely different ways to approach the same vector...