UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: laundRy (R) #19

Open arunmarria opened 4 years ago

arunmarria commented 4 years ago

name: laundRy about: Use this template to submit software for review

Submitting Author: Arun Marria (@amarria1), Cari Gostic(@cgostic), Alexander Hinton(@zanderhinton), Aman Kumar(@amank90) Repository: https://github.com/UBC-MDS/laundRy Version submitted: v1.0.1 Editor: Varada (@kvarada )
Reviewer 1: Tani(@TBarasch ) Reviewer 2: Brayden (@braydentang1) Archive: TBD
Version accepted: TBD


DESCRIPTION

Package: laundRy
Title: laundRy - preprocessing package for dataframes
Version: 0.1
Authors@R: person("Arun", "Marria", email = "arun.marria@gmail.com",
                  role = c("aut", "cre"))
           person("Cari", "Gostic", email = "cari.gostic@gmail.com",
                  role = c("aut", "cre"))
           person("Alexander", "Hinton", email = "alexander_hinton@ubc.alumni.ca",
                  role = c("aut", "cre"))
           person("Aman", "Garg", email = "gargkaman7@gmail.com",
                  role = c("aut", "cre"))
Description: The laundRy package performs many standard preprocessing techniques 
    for dataframes, before use in statistical analysis and machine learning. 
    The package functionality includes categorizing column types, handling 
    missing data and imputation, transforming/standardizing columns and 
    feature selection. The laundRy package aims to remove much of the 
    grunt work in the typical data science workflow, allowing the analyst 
    maximum time and energy to devote to modelling!
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
Suggests: 
    testthat,
    knitr,
    rmarkdown,
    covr
Imports: 
    dplyr,
    tidyr,
    rlang,
    magrittr,
    caret,
    stats,
    DT,
    e1071
VignetteBuilder: knitr
URL: https://github.com/UBC-MDS/laundRy
BugReports: https://github.com/UBC-MDS/laundRy/issues

Scope

Preprocessing is necessary part of data analysis and laundRy equips data scientists to perform that in a seamless fashion. The package has functions which do the preprocessing automatically by taking inputs from user. Thus, it helps in workflow automation.

Additionally, as the preprocessing involves performing multiple transformations on data, thus the package also falls in Data munging category.

laundRy is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a machine learning or statistical model, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.

There are individual packages for data preprocessing but no package that does all things in one go(automatically). Related packages - caret

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

JOSS Options - [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

braydentang1 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing:

4-6


Review Comments

Hey! I thought your package was overall well thought out and indeed, makes the data preprocessing step more streamlined and less of a chore. It also felt intuitive to use - the function categorize clearly had a well thought out integration with the other functions and it is very immediate to the user. Here are some of the things that I came across while running your package and reading your documentation. First, right off the bat with some general comments:

Now for some things I noticed:

categorize:

column_transformer:

data <- mtcars
column_list_mt <- categorize(data, max_cat = 10)

train <- data[1:30, ]
test <- data[31:32, ]

my_processed_frames <- column_transformer(
  x_train = train,
  x_test = test,
  column_list = column_list_mt,
  num_trans = "standard_scaling",
  cat_trans = "onehot_encoding"
)

It would appear that the standard scaling works as intended but the onehot_encoding is not? My categorical columns just pass through as is. The categorical columns still remain. Please correct me if I am not using this correctly!

  for (vect in names(column_list)){
    for ( col in column_list[[vect]]){
      if(!is.element(col, names(x_train)))
        stop("Column names in the named list must be present in dataframe")
    }
  }

could be written as:

  for (vect in names(column_list)) {
    for (col in column_list[[vect]]) {
      if (!is.element(col, names(x_train))) 
        stop("Column names in the named list must be present in dataframe")
    }
  }

This is pretty nitpicky and probably really annoying/tedious to deal with, so feel free to ignore.

feature_selection:

For regression tasks, a list actually isn't returned - a factor vector is.

data <- mtcars

fs_reg <- feature_selection(
  X = data[, !names(data) %in% c("mpg")], 
  y = data$mpg,
  mode = "regression",
  n_features = 5
)

...this returns a factor vector with 10 levels, not a list. Not too sure if this is intended behavior here. Also, if I run this:

data <- mtcars

fs_class <- feature_selection(
  X = data[, !names(data) %in% c("cyl")], 
  y = as.factor(data$cyl),
  mode = "classification",
  n_features = 5
)

...I get an error:

Error in { : task 1 failed - "'names' attribute [3] must be the same length as the vector [2]"

Finally, if I run this:

data <- mtcars

fs_reg <- feature_selection(
  X = data[, !names(data) %in% c("mpg")], 
  y = data$mpg,
  mode = "regression",
  n_features = 50 # purposely fit more features than I actually have here
)

...I get a factor vector of length 50 here.

fill_missing

test* functions

Miscellaneous:

Authors@R: c(person("Arun", "Marria", email = "arun.marria@gmail.com",
                  role = c("aut")),
           person("Cari", "Gostic", email = "cari.gostic@gmail.com",
                  role = c("aut", "cre")),
           person("Alexander", "Hinton", email = "alexander_hinton@ubc.alumni.ca",
                  role = c("aut")),
           person("Aman", "Garg", email = "gargkaman7@gmail.com",
                  role = c("aut")))

to get all of your names to show up on your website (unless you meant to only show Aman's name). We had the same issue - I guess the intention is to provide a vector. If you do it this way, you won't pass check() if you have multiple "creators". Above I specified Cari as the creator, feel free to change of course as required. I guess only one can be designated as the creator for some strange reason.

Overall, nice job! I hope my comments are helpful.

TBarasch commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

.

Functionality

Final approval (post-review)

Estimated hours spent reviewing:

4-6


Review Comments

Download was easy and worked fine, instructions were very clear and at the top of the README which is great. Overall well thought out functions and steps needed to cover to streamline the work. good vignette and structure, some kinks in the functions themselves and documentation could be improved a bit.

Specifics for the functions: To test this package I used the cars data set and added columns as needed to test different interactions.

library(dplyr)
library(laundRy)

# load cars data
cars <- cars

# add numeric categorical ‘num’
cars['num'] <- sample(c(0:6),nrow(cars), replace=T)

# add string categorical ‘cat’
temp <- c()
while(length(temp)<50){
  for(i in c("one","two","3")){
    temp <- c(temp,i)
    if(length(temp)==50){
      break
    }
  }
}
cars['cat'] <- temp

# add problems to DF
cars[5,] <- c(NA,NA,NA,NA)
cars[3,2] = ""
cars[2,4] = 5

# train test df
train <- cars[1:35,]
test <- cars[36:50,]

categorize():

Initially worked when i ran it, but later when wiping my environment and running this code again, i got an error and i'm not sure what is causing it:

cars <- cars
train <- cars[1:35, ]
test <- cars[36:50, ]
t_list <-laundRy::categorize(train)
laundRy::column_transformer(train,test,t_list)

both test and train of class() == data.frame

column_transformer():

For some reason i can’t explain it would sometimes run and sometimes fail Data frame error message:

Error: Matrices or data frames are required for preprocessing

using this code

train <- cars[1:35, ]
test <- cars[36:50, ]
t_list <-laundRy::categorize(train)
laundRy::column_transformer(train,test,t_list)

feature_selection():

Ultimately couldn’t quit get this function to work, sorry. Possibly the problem is that the named list is being tested to be a data frame as well? can’t think of anything else.

fill_missing():

Could not really test this function with the cars data, so had to use the example given, testing was therefor not as comprehensive.

cgostic commented 4 years ago

Hi @braydentang1 and @TBarasch! Thank you so much for taking the time to review our package. Your feedback was very helpful, and we were able to address the following points to improve our code and documentation:

Brayden's review points:

Categorize

Column transformer

Feature selection

Fill missing

Tests

Misc

Tani's Review Points:

Categorize

Column transformer

Feature Selection

Fill missing

The link of the package with latest release is here