fstpackage / synthetic

R package for dataset generation and benchmarking
GNU Affero General Public License v3.0
20 stars 1 forks source link

Method table_definition() can create a model of a source dataset #37

Open MarcusKlik opened 4 years ago

MarcusKlik commented 4 years ago

And generate new sample data from that. Correlations between columns can be retained:

dt <- fread("some_data.csv")
generator <- table_definition(dt, id =  "some data sample", model = TRUE, correlate = TRUE)
MarcusKlik commented 4 years ago

Let's start with a per-column model without correlations as a start.

MarcusKlik commented 4 years ago

Numerical data can be fitted with a polynomial approximation of the distribution. We have to check for factor levels first (when the number of levels is much smaller than the column size)