Submission: laundRy (R)

name: laundRy about: Use this template to submit software for review

Submitting Author: Arun Marria (@amarria1), Cari Gostic(@cgostic), Alexander Hinton(@zanderhinton), Aman Kumar(@amank90) Repository: https://github.com/UBC-MDS/laundRy Version submitted: v1.0.1 Editor: Varada (@kvarada )
Reviewer 1: Tani(@TBarasch ) Reviewer 2: Brayden (@braydentang1) Archive: TBD
Version accepted: TBD

DESCRIPTION

Package: laundRy
Title: laundRy - preprocessing package for dataframes
Version: 0.1
Authors@R: person("Arun", "Marria", email = "arun.marria@gmail.com",
                  role = c("aut", "cre"))
           person("Cari", "Gostic", email = "cari.gostic@gmail.com",
                  role = c("aut", "cre"))
           person("Alexander", "Hinton", email = "alexander_hinton@ubc.alumni.ca",
                  role = c("aut", "cre"))
           person("Aman", "Garg", email = "gargkaman7@gmail.com",
                  role = c("aut", "cre"))
Description: The laundRy package performs many standard preprocessing techniques 
    for dataframes, before use in statistical analysis and machine learning. 
    The package functionality includes categorizing column types, handling 
    missing data and imputation, transforming/standardizing columns and 
    feature selection. The laundRy package aims to remove much of the 
    grunt work in the typical data science workflow, allowing the analyst 
    maximum time and energy to devote to modelling!
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
Suggests: 
    testthat,
    knitr,
    rmarkdown,
    covr
Imports: 
    dplyr,
    tidyr,
    rlang,
    magrittr,
    caret,
    stats,
    DT,
    e1071
VignetteBuilder: knitr
URL: https://github.com/UBC-MDS/laundRy
BugReports: https://github.com/UBC-MDS/laundRy/issues

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [x] workflow automaton
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Preprocessing is necessary part of data analysis and laundRy equips data scientists to perform that in a seamless fashion. The package has functions which do the preprocessing automatically by taking inputs from user. Thus, it helps in workflow automation.

Additionally, as the preprocessing involves performing multiple transformations on data, thus the package also falls in Data munging category.

Who is the target audience and what are scientific applications of this package?

laundRy is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a machine learning or statistical model, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

There are individual packages for data preprocessing but no package that does all things in one go(automatically). Related packages - caret

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[ ] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[ ] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[ ] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

4-6

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hey! I thought your package was overall well thought out and indeed, makes the data preprocessing step more streamlined and less of a chore. It also felt intuitive to use - the function categorize clearly had a well thought out integration with the other functions and it is very immediate to the user. Here are some of the things that I came across while running your package and reading your documentation. First, right off the bat with some general comments:

Your package installs correctly, nice!
Immaculate tests that are extremely thorough and well thought out. The organization of these tests, along with the helper functions made it super easy to read and understand.
All tests pass on my computer as required.
Impressive coverage (it's perfect 100%)
You guys pass the check() command with 46 warnings but this doesn't matter! It looks like you guys have perfect seperation (some variable perfectly predicts "0" or "1" in one of your examples, in case you guys didn't already know this and want to address these warnings), leading to fitted coefficients of negative infinity/positive infinity.

Now for some things I noticed:

categorize:

Super straightforward and easy to use. I did not encounter any bugs and it ran perfectly fine. Super nitpicky thing here, forgive me, but in your documentation, perhaps change the phrase "sub vectors" to just "character vectors". Also, in your example perhaps add spaces to each "," in your vectors (ex: c(1.2, 2.3, 3.4) rather than c(1.2,2.3,3.4)) to conform to tidyverse standards.
goodpractice::gp() complains about long lines, but this is nitpicky (and tedious to deal with).

column_transformer:

I ran into some problems with this function. Using the mtcars dataset,

data <- mtcars
column_list_mt <- categorize(data, max_cat = 10)

train <- data[1:30, ]
test <- data[31:32, ]

my_processed_frames <- column_transformer(
  x_train = train,
  x_test = test,
  column_list = column_list_mt,
  num_trans = "standard_scaling",
  cat_trans = "onehot_encoding"
)

It would appear that the standard scaling works as intended but the onehot_encoding is not? My categorical columns just pass through as is. The categorical columns still remain. Please correct me if I am not using this correctly!

Documentation: I think you intended to have a space in the argument x_train for "trainingset"? Sorry for the nitpick here.
Documentation: For the argument column_list, it would help to know what names the user needs to include in the list, like list(name1 = c(blah, blah), name2 = c(blah, blah)). Presumably, the intention is to pass a named list created using categorize? Looks like the function fill_missing does describe this.
Documentation: For the num_trans and cat_trans arguments, it seems like there are other methods like label_encoding/minmax_scaling judging by the error messages I get when specifying "standard_scaling" for cat_trans. These options should be mentioned in the documentation.
Documentation: No example given here.
Documentation: The value here should be written as a sentence "A list with named items x_train and x_list that have been transformed according to the arguments specified" or something along those lines?
Code: after running goodpractice::gp(), there's some usage of "=" rather than "<-", long lines, and some tidyverse styling problems with bracketing around for and if statements, such as:

  for (vect in names(column_list)){
    for ( col in column_list[[vect]]){
      if(!is.element(col, names(x_train)))
        stop("Column names in the named list must be present in dataframe")
    }
  }

could be written as:

  for (vect in names(column_list)) {
    for (col in column_list[[vect]]) {
      if (!is.element(col, names(x_train))) 
        stop("Column names in the named list must be present in dataframe")
    }
  }

This is pretty nitpicky and probably really annoying/tedious to deal with, so feel free to ignore.

feature_selection:

Documentation: typo here, "vector of renponse" for argument y
Documentation: the description...what method does this function actually use? It looks like it uses recursive feature elimination as outlined by Kuhn. Note that this method I think does not behave like how we learned in 573/sklearn in case you guys didn't realize this! It uses a more informative way to remove features by incorporating train/test validation scores and subsets of the "most important features" as given in the "full" model. The link above explains it more thoroughly in case you guys care.
Documentation: for mode, perhaps consider "a string, either "regression" or "classification"".
Documentation: n_features, it is not clear from the documentation what this is supposed to be. The documentation just says "int". Presumably, this is the max_features retained (or in Kuhn's case, the subset of features to select?)
Documentation: for value, a list of what?
Documentation: Examples, the lines are a bit long here according to goodpractice::gp().
Some things I noticed after using this function:

For regression tasks, a list actually isn't returned - a factor vector is.

data <- mtcars

fs_reg <- feature_selection(
  X = data[, !names(data) %in% c("mpg")], 
  y = data$mpg,
  mode = "regression",
  n_features = 5
)

...this returns a factor vector with 10 levels, not a list. Not too sure if this is intended behavior here. Also, if I run this:

data <- mtcars

fs_class <- feature_selection(
  X = data[, !names(data) %in% c("cyl")], 
  y = as.factor(data$cyl),
  mode = "classification",
  n_features = 5
)

...I get an error:

Error in { : task 1 failed - "'names' attribute [3] must be the same length as the vector [2]"

Finally, if I run this:

data <- mtcars

fs_reg <- feature_selection(
  X = data[, !names(data) %in% c("mpg")], 
  y = data$mpg,
  mode = "regression",
  n_features = 50 # purposely fit more features than I actually have here
)

...I get a factor vector of length 50 here.

fill_missing

Overall, this function works very well.
Documentation: for num_imp and cat_imp, what are the other options besides "mean" and "mode"? Perhaps state them.
Documentation: The value returned is a list with two items named "x_train" and "x_test". Perhaps specify this in the documentation.
Documentation: long lines in "Examples" brought up by goodpractice::gp(). Again, not a big deal at all.

test* functions

Just one little thing from goodpractice::gp(): there's assignment using "=" rather than "<-".

Miscellaneous:

goodpractice::gp() complains about 1:length(...), 1:nrow(...) usage in for loops rather than seq_along, and also the use of sapply. Feel free to ignore this stuff (and really, any of the styling stuff).
in your DESCRIPTION file, in the Authors@R section, try this instead:

Authors@R: c(person("Arun", "Marria", email = "arun.marria@gmail.com",
                  role = c("aut")),
           person("Cari", "Gostic", email = "cari.gostic@gmail.com",
                  role = c("aut", "cre")),
           person("Alexander", "Hinton", email = "alexander_hinton@ubc.alumni.ca",
                  role = c("aut")),
           person("Aman", "Garg", email = "gargkaman7@gmail.com",
                  role = c("aut")))

to get all of your names to show up on your website (unless you meant to only show Aman's name). We had the same issue - I guess the intention is to provide a vector. If you do it this way, you won't pass check() if you have multiple "creators". Above I specified Cari as the creator, feel free to change of course as required. I guess only one can be designated as the creator for some strange reason.

Overall, nice job! I hope my comments are helpful.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions in R help
[ ] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

.

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

4-6

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Download was easy and worked fine, instructions were very clear and at the top of the README which is great. Overall well thought out functions and steps needed to cover to streamline the work. good vignette and structure, some kinks in the functions themselves and documentation could be improved a bit.

Specifics for the functions: To test this package I used the cars data set and added columns as needed to test different interactions.

library(dplyr)
library(laundRy)

# load cars data
cars <- cars

# add numeric categorical ‘num’
cars['num'] <- sample(c(0:6),nrow(cars), replace=T)

# add string categorical ‘cat’
temp <- c()
while(length(temp)<50){
  for(i in c("one","two","3")){
    temp <- c(temp,i)
    if(length(temp)==50){
      break
    }
  }
}
cars['cat'] <- temp

# add problems to DF
cars[5,] <- c(NA,NA,NA,NA)
cars[3,2] = ""
cars[2,4] = 5

# train test df
train <- cars[1:35,]
test <- cars[36:50,]

`categorize()`:

Works, easy to understand functionality, can definitely be useful and streamline wrangling work, was considering writing a function like this myself, however i could not quit work out some conceptual kinks so decided to do something easier.
Conceptually the problem is that when trying to recognise a numeric categorical vs a numeric column counting something (for example “type of car” vs “number of people in the car”), while the former is categorical and can be encoded numerically, the latter is truly numeric, but can only take integers. When testing integer columns were classified as categorical.
Tested for ‘NA’ values, did not fail
Worked properly when added a number to a string column.
When inputted an empty string into a numeric column, the column failed to be categorised (either as categorical or numeric).
Possible workaround/fix to all these problems would be adding a feature allowing the used to designate the column type as input.
Docs info could use an output for the example.

Initially worked when i ran it, but later when wiping my environment and running this code again, i got an error and i'm not sure what is causing it:

cars <- cars
train <- cars[1:35, ]
test <- cars[36:50, ]
t_list <-laundRy::categorize(train)
laundRy::column_transformer(train,test,t_list)

both test and train of class() == data.frame

`column_transformer()`:

When running a numeric column categorized as categorical, OHE did not work: a column of integer values between 0-6 remained as it was after calling the function.
When added a number into a string categorical column, that value was ignored and a random categorical was assigned in the OHE.
Did not break when testing with NA values
When an empty string was injected into a numeric column, standard scaling did not work on the column (might want to add some warning message)
Documentation should contain the different options for the num_trans and cat_trans
Could add example to documentation.

For some reason i can’t explain it would sometimes run and sometimes fail Data frame error message:

Error: Matrices or data frames are required for preprocessing

using this code

train <- cars[1:35, ]
test <- cars[36:50, ]
t_list <-laundRy::categorize(train)
laundRy::column_transformer(train,test,t_list)

`feature_selection()`:

Could not get function to work standalone, if it is dependent on other process (pre-processing through column_transformer or other functions in laundRy) this needs to be mentioned.
Might want to add a default for n_features
Documentation could be a little clearer in Description and Arguments
Example output missing, and usage is long (consider splitting to more then one row per example)

Ultimately couldn’t quit get this function to work, sorry. Possibly the problem is that the named list is being tested to be a data frame as well? can’t think of anything else.

`fill_missing()`:

Could not get function to work standalone, if it is dependent on other process (pre-processing through column_transformer or other functions in laundRy) this needs to be mentioned (and the order).
Seems to do its job for numeric, but only works for OHE categoricals(?) not listed in docs (tried with string and integers)
Default for num_imp not working (listed in arguments, not in description)
Might want to add defaults for num_imp and cat_imp

Could not really test this function with the cars data, so had to use the example given, testing was therefor not as comprehensive.

Hi @braydentang1 and @TBarasch! Thank you so much for taking the time to review our package. Your feedback was very helpful, and we were able to address the following points to improve our code and documentation:

Brayden's review points:

Categorize

Updated spacing between list items to comply with R standard syntax.
Added clarity to what will be classified as categorical vs. numeric.
Truncated long lines (80 char) in test and function
Changed the phrase "sub vectors" to just "character vectors".

Column transformer

Fixed issue when running mtcars dataset through function
Specified vector names needed in column list argument
Mentioned other possibilities for scaling options in documentation
Added example in documentation
Clarified output of function in documentation

Feature selection

Update return type in documentation
Addressed error when running function on mtcars
Fixed typo in the documentation, and updated description of mode argument
Updated documentation for RFE method

Fill missing

Changed the phrase "sub vectors" to just "character vectors" in documentation
Clarified description of return value
Truncated long lines (80 char) in test and function

Tests

changed all = to <- where appropriate

Misc

Fixed list of all authors in description as suggested

Tani's Review Points:

Categorize

Clarified in documentation how columns are classified
Added example of hard-coding column types in vignette

Column transformer

Implemented OHE for numeric and alphanumeric columns categorized as categorical.
Added documentation for the options which num_trans and cat_trans arguments.
Added example in documentation.
Fixed random error - "Error: Matrices or data frames are required for preprocessing"

Feature Selection

Updated the documentation for the function
Added default value for n_features

Fill missing

Adjusted function so categorical columns can be encoded with both integers [1,2,3] or character ["a", "b", c"] vectors. Previously only worked with integers which was not intuitive.

The link of the package with latest release is here

UBC-MDS / software-review