UBC-MDS / Rmleda

R package that helps with preliminary eda for supervised machine learning tasks.
https://ubc-mds.github.io/Rmleda/
Other
0 stars 0 forks source link

Create a vignette showing overview and example usage of package #37

Closed yaz-saleh closed 3 years ago

Saule-Atymtayeva commented 3 years ago

Introduction to Rmleda

This vignette describes the use of the SupervisedData(), dftype(), autoimpute_na(), and dfscaling() functions included in package Rmleda. The Rmleda package helps with preliminary EDA for a given dataset to perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well.

Split a dataframe into train and test sets with SupervisedData()

TODO


some code

Return the type of columns and variables for the input dataframe with dftype()

dftype() function will return the type of columns and variables for the input data frame. Furthermore, if there are non-numeric columns, it will return the unique values of non-numeric columns and their length.
The function returns a list of results. The first entity ($summary) is a summary of each column and the second entity($unique_df) is a data frame that contains a non-numeric column name, unique values of each column, and its count.

#> dftype(df)
$summary
 chocolate_brand        price         type          
 Length:5           Min.   :3.0   Length:5          
 Class :character   1st Qu.:3.0   Class :character  
 Mode  :character   Median :3.5   Mode  :character  
                    Mean   :4.0                     
                    3rd Qu.:4.5                     
                    Max.   :6.0                     
                    NA's   :1                          

$unique_df
      column_name                      unique_values num_unique_values
1 chocolate_brand Lindt,Rakhat,Richart,not available                 4
2            type                         dark,white                 2

Identify and impute missing values in a given dataframe with autoimpute_na()

The autoimpute_na() function identifies and imputes missing values in a given dataframe based on the types of the columns, i.e. the function fills missing values with the mean for numeric columns and the most frequent value for non-numeric columns. Additionally, the autoimpute_na() function detects some common non-standard missing values manually entered by users (e.g., “not available”, “n/a”, “na”, “-”) while identifying and imputing missing data. The output of the autoimpute_na() function will be a dataframe with imputed values.

For example, in the toy dataset below we are imputing the missing values in chocolate_brand and price columns where the missing value in the first column is entered manually as not available and that in the second one is the NA value.

#> df
  chocolate_brand price  type
1           Lindt     3  dark
2          Rakhat    NA  dark
3           Lindt     4 white
4         Richart     6 white
5   not available     3  dark

The autoimpute_na() function's output is given below. Now, the not available is replaced with Lindt which the most frequent value in the chocolate_brand column, while NA becomes 4 which is the mean of the price column values.

#> autoimpute_na(df)
  chocolate_brand price  type
1           Lindt     3  dark
2          Rakhat     4  dark
3           Lindt     4 white
4         Richart     6 white
5           Lindt     3  dark

Apply scaling and centering to the numerical features in a given input dataframe with dfscaling()

The dfscaling function applies standard scaling and centering to the numeric features of a given input dataframe. Each of the numeric columns will have a mean of 0 and standard deviation of 1 after the transformation. All columns with zero-variance are excluded prior to applying this transformation.

The function takes two arguments, the input dataframe and the name of the target column (type in the below example).

#> dfscaling(df, type)
  chocolate_brand  price type 
  <fct>            <dbl> <fct>
1 Lindt           -0.707 dark 
2 Rakhat          NA     dark 
3 Lindt            0     white
4 Richart          1.41  white
5 not available   -0.707 dark