fstpackage / synthetic

R package for dataset generation and benchmarking
GNU Affero General Public License v3.0
20 stars 1 forks source link

Method table_definition should allow for a real dataset that can be used for sampling #35

Closed MarcusKlik closed 4 years ago

MarcusKlik commented 4 years ago

Like:

dt <- fread("some_data.csv")
generator <- table_definition(dt, id =  "some data sample")

This will automatically generate a table definition that samples the actual data to generate new datasets (by storing original dataset and sampling that data)

MarcusKlik commented 4 years ago

You can use an existing table a template now:

library(synthetic)

# use iris dataset as template
synth_iris <- synthetic_table(iris, id = "synthetic iris")

# generate a 1e6 rows synthetic 'iris' dataset 
generate(synth_iris, 1e6)
#> # A tibble: 1,000,000 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          6.3         2.8          5.1         1.5 virginica 
#>  2          6.7         3.3          5.7         2.1 virginica 
#>  3          4.7         3.2          1.6         0.2 setosa    
#>  4          5.8         2.8          5.1         2.4 virginica 
#>  5          5.1         3.5          1.4         0.2 setosa    
#>  6          4.9         3.1          1.5         0.1 setosa    
#>  7          6           2.2          4           1   versicolor
#>  8          6.4         3.2          4.5         1.5 versicolor
#>  9          4.6         3.4          1.4         0.3 setosa    
#> 10          4.9         3.1          1.5         0.1 setosa    
#> # ... with 999,990 more rows

# each call to generate creates a unique dataset
generate(synth_iris, 1e6)
#> # A tibble: 1,000,000 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          6.7         3.3          5.7         2.5 virginica 
#>  2          4.8         3            1.4         0.3 setosa    
#>  3          5.2         3.5          1.5         0.2 setosa    
#>  4          6.4         2.8          5.6         2.2 virginica 
#>  5          5           3.4          1.5         0.2 setosa    
#>  6          5.2         4.1          1.5         0.1 setosa    
#>  7          7.7         3            6.1         2.3 virginica 
#>  8          6.5         3            5.8         2.2 virginica 
#>  9          7           3.2          4.7         1.4 versicolor
#> 10          4.9         3.6          1.4         0.1 setosa    
#> # ... with 999,990 more rows

# column selection
generate(synth_iris, 1e6, c("Sepal.Length", "Species"))
#> # A tibble: 1,000,000 x 2
#>    Sepal.Length Species   
#>           <dbl> <fct>     
#>  1          5.5 versicolor
#>  2          6   versicolor
#>  3          5   setosa    
#>  4          5.7 setosa    
#>  5          5.8 versicolor
#>  6          5.7 versicolor
#>  7          5.5 setosa    
#>  8          5.5 versicolor
#>  9          4.6 setosa    
#> 10          6.3 virginica 
#> # ... with 999,990 more rows