UBC-MDS / software-review-2021

1 stars 1 forks source link

Submission: prepropy-r (R) #45

Open BruhatMusunuru opened 3 years ago

BruhatMusunuru commented 3 years ago

Submitting Author: Bruhat Musunuru (BruhatM) Other Authors: Pan Fan(pan1fan2), Chun Chieh(Jason) Chang (jachang0628) Repository: https://github.com/UBC-MDS/prepropy-r Version submitted: Editor: TBD Reviewers: TBD

Archive: TBD Version accepted: TBD


prepropyr
R-CMD-check codecov

A R package for data preprocessing

Overview
Data preprocessing and EDA are essential to any data science project. EDA provides insights into a dataset , visualizes and interprets the information that is hidden in the dataset. Data preprocessing is crucial to get scale features to train better models and handle missing values. In the real world, datasets contain a large number of features and observations and it is unrealistic to expect that raw dataset is perfect and ready for model building. The package aims to facilitate users to perform data imputation, feature scaling and basic exploratory data analysis for machine learning modeling.

A vignettes for this package can be found here.

Features
The package is under development, it will includes the following functions:

Imputer : Identify and handle missing values in a dataframe
A function that will impute missing data given chosen method(mean, median, or most frequent)
Can work on both numerical and categorical data
Feature Scaler: Performs Numerical Feature Scaling
Scale Numerical Features to facilitate seamless building of machine learning pipelines
Provide functionality to pick from multiple scaling algorithms
EDA : Extract info and Visualize selected features in a dataframe
Separate data into train/test dataset
Report number of missing data
Report feature types (numerical V.S. categorical)
Report class imbalance
Investigate the correlation matrix
EDA and data preprocessing are crucial steps to take before diving into any machine learning models. Open-source libraries such as MICE, tidyverse, and ggplot2 provide functions to perform data scaling, data imputation, descriptive data analysis, and graphing etc. We are not reinventing the functions but we want to integrate function across the packages and provide a quick overview of the data to users. We hope the package can speed up the data analysis process for our users.

Usage
eda()

The eda() function helps to quickly explore the data by showing a pairplot and some summary statistics for a given dataframe .

library(prepropyr)

df <- data.frame(num1 = c(8.5, 8, 9.2, 9.1, 9.4),
                  num2 = c(0.88, 0.93, 0.95 , 0.92 , 0.91),
                  num3 = c(0.46, 0.78, 0.66, 0.69, 0.52),
                  num4 = c(0.082, 0.078, 0.082, 0.085, 0.066),
                  cat1 = c("Good","Okay","Excellent","Terrible","Good"),
                  target = c(2,2,3,1,3)
After calling the eda() function, we can get following outputs,see the docstring for more available outputs.

result <- eda(df,"target")
result$nb_num_features
4
imputation()

The imputation() function will impute missing data in a tibble/dataframe given the chosen method(mean, median)

test_df <- data.frame('a' = c(1,NA,3), 'b' = c(5,6,NA), 'c' = c(NA,1,10))
test_df_imputed <- imputation(test_df, test_df, 'mean')
scaler()

This function scales numerical features based on scaling requirement(standardization, minmax Scaling) in a data.frame

X_train <- data.frame('a' = c(1,2,3), 'b' = c(5,6,3), 'c' = c(2,1,10))
X_test <- data.frame('a' = c(1,5,3), 'b' = c(5,6,5), 'c' = c(2,5,10))
X_Valid <- data.frame('a' = c(5,5,3), 'b' = c(5,6,5), 'c' = c(2,5,10))
scaled_df <- scaler(X_train, X_Valid, X_test, scaler_type='standardization')
Installation
You can install the released version of prepropyr as follows:

devtools::install_github("UBC-MDS/prepropy-r")

Scope

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

ttimbers commented 3 years ago

Assigning @charlessuresh & @jachang0628 as reviewers.

charlessuresh commented 3 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Estimated hours spent reviewing: 4


Review Comments

Dear Bruhat, Pan and Chun,

Congrats on completing the package! I can see how your package would be useful to many.
I was able to successfully install your package and run all your functions.

Here are some of my comments:

Function call using scaler_type='standardization':

X_train <- data.frame('a' = 1, 'b' = 5)
X_test <- data.frame('a' = 1, 'b' = 5)
X_Valid <- data.frame('a' = 1, 'b' = 5)
scaled_df <- scaler(X_train, X_Valid, X_test, scaler_type='standardization')

Error:

Std. deviations could not be computed for: a, b

Function call using scaler_type='minmax':

scaled_df <- scaler(X_train, X_Valid, X_test, scaler_type='minmax')

Error:

No variation for for: a, bSTATS is longer than the extent of 'dim(x)[MARGIN]'STATS is longer than the extent of 'dim(x)[MARGIN]'STATS is longer than the extent of 'dim(x)[MARGIN]'STATS is longer than the extent of 'dim(x)[MARGIN]'

Maybe it makes sense to raise exception with an appropriate error message for dataframes passed with single row entries?

Great work on this package! It was my pleasure to write this review. Let me know if there are any questions.

Thanks, Charles