UBC-MDS / Rmleda

R package that helps with preliminary eda for supervised machine learning tasks.
https://ubc-mds.github.io/Rmleda/
Other
0 stars 0 forks source link

Rmleda

The goal of Rmleda is to build a R package called Rmleda with relevant functions to help with preliminary EDA for a given dataset. The package contains functions and classes that help perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.

The Rmleda package will include the following classes/functions:

The Rmleda package is intended to help with EDA for supervised machine learning tasks; there are other existing packages that contain some similar functionality. For example, the packages, namely tidymodels, MICE, Amelia, missForest, Hmisc, and mi, provide users with functions to analyze attribute types, detect missing values and fill them, and scale input data. Our Rmleda package intends to augment the existing functionality of these packages with some additional features and tie it all into a single useful package.

The autoimpute_na() function will additionally detect some common non-standard missing values manually entered by users (e.g., “not available”, “n/a”, “na”, “-”) while identifying and imputing missing data. The output of the autoimpute_na() function will be a dataframe with imputed values.

In supervised machine learning, data splitting is often a multi-step process that involves splitting the dataset of interest into test and train portions and then further into X(features) and y(target class) subsets. Typically, the user has to create and keep track of different variables that hold each of these subsets of the data. supervised_data in Rmleda is a function that builds upon the initial_split function from the tidymodels package. supervised_data provides the user with quick access to appropriately-named attributes referring to each of the aforementioned variables so they can be easily accessed for use in subsequent steps of the machine learning pipeline.

Installation

If you do not have the devtools package, you can install it via CRAN with:

install.packages("devtools")

Then, install Rmleda from GitHub as follows:

devtools::install_github("UBC-MDS/Rmleda")

Lastly, load the package:

library(Rmleda)

Example

Rmleda::dftype(df)
Rmleda::autoimpute_na(df)
Rmleda::dfscaling(df, target)
super_data <- Rmleda::supervised_data(df, xcols = c('feature1', 'feature2'),ycol = c('target'))