bcgov / dipsim

Tool for simulating data from .parquet data sets
Apache License 2.0
0 stars 0 forks source link
citz data-science developer-tools programming

dipsim

Simulating Data for BCGov Researchers


Lifecycle:Maturing License

What is dipsim?

It is an R package with tools designed for simulating data from data sets in .parquet format. It was primarily developed to help BCGov researchers working in the DIP to create secondary datasets for use during testing and development.

Developing code for data science applications can be time consuming and testing code on massive data sets can slow down development significantly. dipsim helps by providing a way to quickly create smaller versions of the actual data set.

Table of Contents

Getting Help

To report bugs/issues/feature requests, please file an issue.

Getting Started

Prerequisites

Installation Documentation

You can install the development version of dipsim from GitHub with:

# install.packages("devtools")
devtools::install_github("bcgov/dipsim")

Example Workflow

This is a basic example of using dipsim to simulate a data set of 50 rows, based on data transformed into parquet format. The data set penguins is found in the CRAN package, palmerpenguins.

library(dipsim)
wd <- "/Users/brobert/Desktop"
##---------------------------------------- load routine --------------------------------------------------
parquet_fp <- search_parquet_data()

input_data <- make_input_data(support_fp = parquet_fp, resize = 100000, folder_location = wd)
##---------------------------------------- generate simulated data ---------------------------------------
simulated_data <- make_simulated_data (samp_size = 50, folder_location = wd, dataset_size = 1000
                                       name = tools::file_path_sans_ext(basename(parquet_fp)))
##----------------------------------------- diagnostics --------------------------------------------------
cols <- compare_data(input_data, simulated_data)
vis_sim (input_data, simulated_data, cols) 
##------------------------------------- clean up temp folder ---------------------------------------------
f=glue::glue("{wd}/{tools::file_path_sans_ext(basename(parquet_fp))}")
unlink(f, recursive = TRUE)

Documentation

After installing the package you can view vignettes by typing browseVignettes("dipsim") in your R session.

Core team

Contributing

If you would like to contribute, please see our CONTRIBUTING guidelines.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

License

Copyright 2021 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the &quot;License&quot;);
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an &quot;AS IS&quot; BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

⬆ Back to Top