Create a vignette showing overview and example usage of package

Introduction to Rmleda

This vignette describes the use of the SupervisedData(), dftype(), autoimpute_na(), and dfscaling() functions included in package Rmleda. The Rmleda package helps with preliminary EDA for a given dataset to perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.

Split a dataframe into train and test sets with `SupervisedData()`

TODO


some code

Return the type of columns and variables for the input dataframe with `dftype()`

dftype() function will return the type of columns and variables for the input data frame. Furthermore, if there are non-numeric columns, it will return the unique values of non-numeric columns and their length.
The function returns a list of results. The first entity ($summary) is a summary of each column and the second entity($unique_df) is a data frame that contains a non-numeric column name, unique values of each column, and its count.

#> dftype(df)
$summary
 chocolate_brand        price         type          
 Length:5           Min.   :3.0   Length:5          
 Class :character   1st Qu.:3.0   Class :character  
 Mode  :character   Median :3.5   Mode  :character  
                    Mean   :4.0                     
                    3rd Qu.:4.5                     
                    Max.   :6.0                     
                    NA's   :1                          

$unique_df
      column_name                      unique_values num_unique_values
1 chocolate_brand Lindt,Rakhat,Richart,not available                 4
2            type                         dark,white                 2

Identify and impute missing values in a given dataframe with `autoimpute_na()`

The autoimpute_na() function identifies and imputes missing values in a given dataframe based on the types of the columns, i.e. the function fills missing values with the mean for numeric columns and the most frequent value for non-numeric columns. Additionally, the autoimpute_na() function detects some common non-standard missing values manually entered by users (e.g., “not available”, “n/a”, “na”, “-”) while identifying and imputing missing data. The output of the autoimpute_na() function will be a dataframe with imputed values.

For example, in the toy dataset below we are imputing the missing values in chocolate_brand and price columns where the missing value in the first column is entered manually as not available and that in the second one is the NA value.

#> df
  chocolate_brand price  type
1           Lindt     3  dark
2          Rakhat    NA  dark
3           Lindt     4 white
4         Richart     6 white
5   not available     3  dark

The autoimpute_na() function's output is given below. Now, the not available is replaced with Lindt which the most frequent value in the chocolate_brand column, while NA becomes 4 which is the mean of the price column values.

#> autoimpute_na(df)
  chocolate_brand price  type
1           Lindt     3  dark
2          Rakhat     4  dark
3           Lindt     4 white
4         Richart     6 white
5           Lindt     3  dark

Apply scaling and centering to the numerical features in a given input dataframe with `dfscaling()`

The dfscaling function applies standard scaling and centering to the numeric features of a given input dataframe. Each of the numeric columns will have a mean of 0 and standard deviation of 1 after the transformation. All columns with zero-variance are excluded prior to applying this transformation.

The function takes two arguments, the input dataframe and the name of the target column (type in the below example).

#> dfscaling(df, type)
  chocolate_brand  price type 
  <fct>            <dbl> <fct>
1 Lindt           -0.707 dark 
2 Rakhat          NA     dark 
3 Lindt            0     white
4 Richart          1.41  white
5 not available   -0.707 dark

UBC-MDS / Rmleda