Open arunmarria opened 4 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing:
4-6
Hey! I thought your package was overall well thought out and indeed, makes the data preprocessing step more streamlined and less of a chore. It also felt intuitive to use - the function categorize clearly had a well thought out integration with the other functions and it is very immediate to the user. Here are some of the things that I came across while running your package and reading your documentation. First, right off the bat with some general comments:
Now for some things I noticed:
categorize
:
column_transformer
:
data <- mtcars
column_list_mt <- categorize(data, max_cat = 10)
train <- data[1:30, ]
test <- data[31:32, ]
my_processed_frames <- column_transformer(
x_train = train,
x_test = test,
column_list = column_list_mt,
num_trans = "standard_scaling",
cat_trans = "onehot_encoding"
)
It would appear that the standard scaling works as intended but the onehot_encoding is not? My categorical columns just pass through as is. The categorical columns still remain. Please correct me if I am not using this correctly!
categorize
? Looks like the function fill_missing
does describe this.for
and if
statements, such as: for (vect in names(column_list)){
for ( col in column_list[[vect]]){
if(!is.element(col, names(x_train)))
stop("Column names in the named list must be present in dataframe")
}
}
could be written as:
for (vect in names(column_list)) {
for (col in column_list[[vect]]) {
if (!is.element(col, names(x_train)))
stop("Column names in the named list must be present in dataframe")
}
}
This is pretty nitpicky and probably really annoying/tedious to deal with, so feel free to ignore.
feature_selection
:
Documentation: typo here, "vector of renponse" for argument y
Documentation: the description...what method does this function actually use? It looks like it uses recursive feature elimination as outlined by Kuhn. Note that this method I think does not behave like how we learned in 573/sklearn in case you guys didn't realize this! It uses a more informative way to remove features by incorporating train/test validation scores and subsets of the "most important features" as given in the "full" model. The link above explains it more thoroughly in case you guys care.
Documentation: for mode, perhaps consider "a string, either "regression" or "classification"".
Documentation: n_features, it is not clear from the documentation what this is supposed to be. The documentation just says "int". Presumably, this is the max_features retained (or in Kuhn's case, the subset of features to select?)
Documentation: for value, a list of what?
Documentation: Examples, the lines are a bit long here according to goodpractice::gp().
Some things I noticed after using this function:
For regression tasks, a list actually isn't returned - a factor vector is.
data <- mtcars
fs_reg <- feature_selection(
X = data[, !names(data) %in% c("mpg")],
y = data$mpg,
mode = "regression",
n_features = 5
)
...this returns a factor vector with 10 levels, not a list. Not too sure if this is intended behavior here. Also, if I run this:
data <- mtcars
fs_class <- feature_selection(
X = data[, !names(data) %in% c("cyl")],
y = as.factor(data$cyl),
mode = "classification",
n_features = 5
)
...I get an error:
Error in { : task 1 failed - "'names' attribute [3] must be the same length as the vector [2]"
Finally, if I run this:
data <- mtcars
fs_reg <- feature_selection(
X = data[, !names(data) %in% c("mpg")],
y = data$mpg,
mode = "regression",
n_features = 50 # purposely fit more features than I actually have here
)
...I get a factor vector of length 50 here.
fill_missing
test* functions
goodpractice::gp() complains about 1:length(...), 1:nrow(...) usage in for loops rather than seq_along, and also the use of sapply. Feel free to ignore this stuff (and really, any of the styling stuff).
in your DESCRIPTION file, in the Authors@R section, try this instead:
Authors@R: c(person("Arun", "Marria", email = "arun.marria@gmail.com",
role = c("aut")),
person("Cari", "Gostic", email = "cari.gostic@gmail.com",
role = c("aut", "cre")),
person("Alexander", "Hinton", email = "alexander_hinton@ubc.alumni.ca",
role = c("aut")),
person("Aman", "Garg", email = "gargkaman7@gmail.com",
role = c("aut")))
to get all of your names to show up on your website (unless you meant to only show Aman's name). We had the same issue - I guess the intention is to provide a vector. If you do it this way, you won't pass check() if you have multiple "creators". Above I specified Cari as the creator, feel free to change of course as required. I guess only one can be designated as the creator for some strange reason.
Overall, nice job! I hope my comments are helpful.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
.
Estimated hours spent reviewing:
4-6
Download was easy and worked fine, instructions were very clear and at the top of the README which is great. Overall well thought out functions and steps needed to cover to streamline the work. good vignette and structure, some kinks in the functions themselves and documentation could be improved a bit.
Specifics for the functions:
To test this package I used the cars
data set and added columns as needed to test different interactions.
library(dplyr)
library(laundRy)
# load cars data
cars <- cars
# add numeric categorical ‘num’
cars['num'] <- sample(c(0:6),nrow(cars), replace=T)
# add string categorical ‘cat’
temp <- c()
while(length(temp)<50){
for(i in c("one","two","3")){
temp <- c(temp,i)
if(length(temp)==50){
break
}
}
}
cars['cat'] <- temp
# add problems to DF
cars[5,] <- c(NA,NA,NA,NA)
cars[3,2] = ""
cars[2,4] = 5
# train test df
train <- cars[1:35,]
test <- cars[36:50,]
categorize()
:Initially worked when i ran it, but later when wiping my environment and running this code again, i got an error and i'm not sure what is causing it:
cars <- cars
train <- cars[1:35, ]
test <- cars[36:50, ]
t_list <-laundRy::categorize(train)
laundRy::column_transformer(train,test,t_list)
both test and train of class() == data.frame
column_transformer()
:num_trans
and cat_trans
For some reason i can’t explain it would sometimes run and sometimes fail Data frame error message:
Error: Matrices or data frames are required for preprocessing
using this code
train <- cars[1:35, ]
test <- cars[36:50, ]
t_list <-laundRy::categorize(train)
laundRy::column_transformer(train,test,t_list)
feature_selection()
:column_transformer
or other functions in laundRy
) this needs to be mentioned.Ultimately couldn’t quit get this function to work, sorry. Possibly the problem is that the named list is being tested to be a data frame as well? can’t think of anything else.
fill_missing()
:column_transformer
or other functions in laundRy
) this needs to be mentioned (and the order). num_imp
not working (listed in arguments, not in description)num_imp
and cat_imp
Could not really test this function with the cars data, so had to use the example given, testing was therefor not as comprehensive.
Hi @braydentang1 and @TBarasch! Thank you so much for taking the time to review our package. Your feedback was very helpful, and we were able to address the following points to improve our code and documentation:
Categorize
Column transformer
Feature selection
Fill missing
Tests
=
to <-
where appropriateMisc
Categorize
Column transformer
Feature Selection
Fill missing
The link of the package with latest release is here
name: laundRy about: Use this template to submit software for review
Submitting Author: Arun Marria (@amarria1), Cari Gostic(@cgostic), Alexander Hinton(@zanderhinton), Aman Kumar(@amank90) Repository: https://github.com/UBC-MDS/laundRy Version submitted: v1.0.1 Editor: Varada (@kvarada )
Reviewer 1: Tani(@TBarasch ) Reviewer 2: Brayden (@braydentang1) Archive: TBD
Version accepted: TBD
DESCRIPTION
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
Preprocessing is necessary part of data analysis and
laundRy
equips data scientists to perform that in a seamless fashion. The package has functions which do the preprocessing automatically by taking inputs from user. Thus, it helps in workflow automation.Additionally, as the preprocessing involves performing multiple transformations on data, thus the package also falls in Data munging category.
laundRy
is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a machine learning or statistical model, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.There are individual packages for data preprocessing but no package that does all things in one go(automatically). Related packages -
caret
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
JOSS Options
- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct