Submission: prepropy-r (R)

prepropyr R-CMD-check codecov A R package for data preprocessing Overview Data preprocessing and EDA are essential to any data science project. EDA provides insights into a dataset , visualizes and interprets the information that is hidden in the dataset. Data preprocessing is crucial to get scale features to train better models and handle missing values. In the real world, datasets contain a large number of features and observations and it is unrealistic to expect that raw dataset is perfect and ready for model building. The package aims to facilitate users to perform data imputation, feature scaling and basic exploratory data analysis for machine learning modeling. A vignettes for this package can be found here. Features The package is under development, it will includes the following functions: Imputer : Identify and handle missing values in a dataframe A function that will impute missing data given chosen method(mean, median, or most frequent) Can work on both numerical and categorical data Feature Scaler: Performs Numerical Feature Scaling Scale Numerical Features to facilitate seamless building of machine learning pipelines Provide functionality to pick from multiple scaling algorithms EDA : Extract info and Visualize selected features in a dataframe Separate data into train/test dataset Report number of missing data Report feature types (numerical V.S. categorical) Report class imbalance Investigate the correlation matrix EDA and data preprocessing are crucial steps to take before diving into any machine learning models. Open-source libraries such as MICE, tidyverse, and ggplot2 provide functions to perform data scaling, data imputation, descriptive data analysis, and graphing etc. We are not reinventing the functions but we want to integrate function across the packages and provide a quick overview of the data to users. We hope the package can speed up the data analysis process for our users. Usage eda() The eda() function helps to quickly explore the data by showing a pairplot and some summary statistics for a given dataframe . library(prepropyr) df <- data.frame(num1 = c(8.5, 8, 9.2, 9.1, 9.4), num2 = c(0.88, 0.93, 0.95 , 0.92 , 0.91), num3 = c(0.46, 0.78, 0.66, 0.69, 0.52), num4 = c(0.082, 0.078, 0.082, 0.085, 0.066), cat1 = c("Good","Okay","Excellent","Terrible","Good"), target = c(2,2,3,1,3) After calling the eda() function, we can get following outputs,see the docstring for more available outputs. result <- eda(df,"target") result$nb_num_features 4 imputation() The imputation() function will impute missing data in a tibble/dataframe given the chosen method(mean, median) test_df <- data.frame('a' = c(1,NA,3), 'b' = c(5,6,NA), 'c' = c(NA,1,10)) test_df_imputed <- imputation(test_df, test_df, 'mean') scaler() This function scales numerical features based on scaling requirement(standardization, minmax Scaling) in a data.frame X_train <- data.frame('a' = c(1,2,3), 'b' = c(5,6,3), 'c' = c(2,1,10)) X_test <- data.frame('a' = c(1,5,3), 'b' = c(5,6,5), 'c' = c(2,5,10)) X_Valid <- data.frame('a' = c(5,5,3), 'b' = c(5,6,5), 'c' = c(2,5,10)) scaled_df <- scaler(X_train, X_Valid, X_test, scaler_type='standardization') Installation You can install the released version of prepropyr as follows: devtools::install_github("UBC-MDS/prepropy-r")

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

[ ] data retrieval
[ ] data extraction
[x] data munging
[ ] data deposition
[ ] workflow automation
[ ] version control
[ ] citation management and bibliometrics
[ ] scientific software wrappers
[ ] field and lab reproducibility tools
[ ] database software bindings
[ ] geospatial data
[ ] text analysis

Explain how and why the package falls under these categories (briefly, 1-2 sentences): Our package contains functions to impute missing values and scale features. These most likely fall under Data munging. We also have a function to Visualize selected features in a dataframe which falls under Data munging and visualization.

Who is the target audience and what are scientific applications of this package? The target audience for our package is a beginner user who is trying out regression and wants to simplify the pre-processing before implementing regression models.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are similar packages like ggplot and caret that have similar functionality to our package. But our package streamlines the process for simplicity and beginner friendliness.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

This package:

[x] does not violate the Terms of Service of any service it interacts with.

[x] has a CRAN and OSI accepted license.

[x] contains a README with instructions for installing the development version.

[x] includes documentation with examples for all functions, created with roxygen2.

[x] contains a vignette with examples of its essential functions and uses.

[x] has a test suite.

[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?

[ ] Do you intend for this package to go on Bioconductor?

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Package Review

Briefly describe any working relationship you have (had) with the package authors. The package authors and I are classmates in the UBC MDS-V 2020-21 program
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work.

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 4

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Dear Bruhat, Pan and Chun,

Congrats on completing the package! I can see how your package would be useful to many.
I was able to successfully install your package and run all your functions.

Here are some of my comments:

Imputation methods in imputation: I think there are different functional claims made for this function regarding the available imputation methods:
1. As per function documentation: This function will impute missing data in a tibble/dataframe given the chosen method (mean, median)
2. As per the README and your package website: A function that will impute missing data given chosen method (mean, median, or most frequent)
3. I see that you've added appropriate code (and error messages) for imputing with a constant value as well
fit_data vs fill_data in imputation: I'm not very clear on the distinction between fit_data and fill_data. I could not find appropriate documentation that defines these two parameters
constant imputation method in imputation: When I pass method='constant' along with a value for parameter constant, the function works fine. However, when I try to to use the constant imputation method without passing any value to parameter constant, the function throws this error:
```
Error in x[[v]][thisvar] <- if (N > 1L) value[n + seq_len(nv)] else value : replacement has length zero
```
Maybe it makes sense to raise exception with an appropriate error message for this?
function scaler: Passing a dataframe with a single row throws errors when using both scaler types:

Function call using scaler_type='standardization':

X_train <- data.frame('a' = 1, 'b' = 5)
X_test <- data.frame('a' = 1, 'b' = 5)
X_Valid <- data.frame('a' = 1, 'b' = 5)
scaled_df <- scaler(X_train, X_Valid, X_test, scaler_type='standardization')

Error:

Std. deviations could not be computed for: a, b

Function call using scaler_type='minmax':

scaled_df <- scaler(X_train, X_Valid, X_test, scaler_type='minmax')

Error:

No variation for for: a, bSTATS is longer than the extent of 'dim(x)[MARGIN]'STATS is longer than the extent of 'dim(x)[MARGIN]'STATS is longer than the extent of 'dim(x)[MARGIN]'STATS is longer than the extent of 'dim(x)[MARGIN]'

Maybe it makes sense to raise exception with an appropriate error message for dataframes passed with single row entries?

eda function: As per the README and your package website, this function will: Separate data into train/test dataset. Looking at the function code and the returned values, I don't think the function currently does this.

Great work on this package! It was my pleasure to write this review. Let me know if there are any questions.

Thanks, Charles

UBC-MDS / software-review-2021