Submission: PrepR (R) - Githubissues

Submitting Author: @camadi @jasmineqyj @matthewconnell @gptzjs (Group 15)
Repository: https://github.com/UBC-MDS/PrepR Version submitted: v1.1.0 Editor: @kvarada Reviewer 1: @zouwenjiao Reviewer 2: @reikookamoto Archive: TBD
Version accepted: TBD

Package: PrepR
Title: Makes Preprocessing Easier
Version: 0.0.0.9000
Authors@R: 
    person("Jasmine",
           "Qin",
           role = c("aut", "cre"),
           email = "yijia.qin@alumni.ubc.ca",
           comment = c(ORCID = "YOUR-ORCID-ID"))
    person("Chimaobi",
           "Amadi",
           role = c("aut", "cre"),
           email = "maobiamadi@gmail.com",
           comment = c(ORCID = "YOUR-ORCID-ID"))
    person("George",
           "Thio",
           role = c("aut", "cre"),
           email = "george.thio@gmail.com",
           comment = c(ORCID = "YOUR-ORCID-ID"))
    person("Matthew",
           "Connell",
           role = c("aut", "cre"),
           email = "matthew.at.connell@gmail.com",
           comment = c(ORCID = "YOUR-ORCID-ID"))
Description: Preprocessing a dataset for machine learning can be a hassle. This package allows you to easily do a train/test/validation split, identify the data types of each column, one hot encode, and scale features.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Imports: 
    dplyr,
    gensvm,
    tibble,
    nlme,
    magrittr
Suggests: 
    covr,
    testthat,
    knitr,
    rmarkdown
VignetteBuilder: knitr
URL: https://github.com/UBC-MDS/PrepR
BugReports: https://github.com/UBC-MDS/PrepR/issues
RoxygenNote: 7.0.2

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [x] data extraction
- [x] data munging
- [ ] data deposition
- [x] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis

train_valid_test_split: This function splits the data set into train, validation, and test sets.

data_type: This function identifies data types for each column/feature. It returns one dataframe for each type of data.

one-hot: This function performs one-hot encoding on the categorical features and returns a dataframe for the train, test, validation sets with sensible column names.

scaler: This function performs standard scaling on the numerical features.

Who is the target audience and what are scientific applications of this package?

Machine Learning Engineers, Data Scientists, students and any other person who is interested in preprocessing data before running machine learning models.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

No single package does the four different functions of preppy524 but there are some functions that does some part of the preppy524 package.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

None

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[ ] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hi Group 15! Thanks for letting me review your package. I have some comments and suggestions below:

Just a small note: I would change the title of this issue to include the correct name of the package (i.e. PrepR)
Although the README contains most of the important information, I suggest adding some additional content:
- Outputs of the functions using pre-loaded data
- A link to a vignette that explains how the functions are to be used - I noticed that you included a Read the Docs link, but it seems to be broken. Perhaps you want to use vignette and pkgdown to setup a website for the package instead. In lecture 6, Tiffany has provided a helpful walk-through on how to do exactly this.
- Citation information - this should be included so that users know how to cite the package.
As for feature requests, here are some ideas that came to mind:
- With respect to scaler(), it would be nice to also have an option analogous to scikit-learn's MinMaxScaler()
- On a related note, it would also be nice to have the option to drop the first column in onehot()
With respect to functionality, I had a problem with onehot(). Although the example provided in the documentation runs properly, I ran into issues when trying to perform one-hot encoding on a data frame I created.
- When I call ?onehot to read the documentation, it says that the function takes in a data frame of categorical features.
- I decided to create a data frame with two columns of categorical data. Here's what happened when I tried calling the function:
- I tried debugging the function line by line and it appears to be lines 31-34 that throw the error:
```
for (level in levels(encodable_df[[header]])) {
columns <- c(columns, level)
}
```

That's all for now, but please let me know if you require any clarifications.

Reiko

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hrs

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Nice job! You have detailed information about the target, installation, function features and usage in your README. It really helps me get familiar with your package immediately. I think your package would really be useful during the data preprocessing in machine learning! And I also have some suggestions for you. I hope that they can help you polish your package.
Though you include a vignette in the vignette file, it would be better to give a link of the knitted version of vignettes in your README, so that people can approach and look through it more conveniently. Vignettes are very important since it demonstrates major functionality that runs successfully locally. People can follow the examples in the vignettes to learn how to use the functions and get familiar with your package more easily and quickly.
The link in the Documentation part is broken. I don't know if it works on your computers. I tried several times, but it is still unable to work. I think it is quite a big issue you should look into immediately. Every time I use a new function, I would look at its documentation first. Without documentation, it would be really hard for me to use a function.
The usage included in your README is very detailed! I am wondering if it will be better you give a demo to each function in the README. Create a synthetic data, or use a dataset people are familiar with(such as Gapminder, mtcars), show the arguments passed into the function and the output of the function. With a demo, people would know what data is suitable for this function and what to expect after applying this function.
When I tried `install.packages("PrepR"), I got an error message as following: Then I tried the development version from GitHub, fortunately, it works!
There is a type error in your README. The One Hot Encode function should be onehot(), while in the README, it is one-hot()
When I used the abalone data to test your functions, the data_type() works out. However, the train_valid_test_split() fails. It gives me an error message: "Please provide a non-empty data.frame object for x". At first, I think it fails since my data is empty. After I checked my data, I found out that the function only accepts dataframe. I think lots of our data in R is tibble, so only accepting dataframe will make your function less applicable and popular. Besides, I think you can also edit your error message and say directly, like "the input should only be a dataframe!". The same problem also exists in the onehot() function. Since I passed a dataframe, I got an empty dataframe as a result. As I can not perform the train_valid_test_split() successfully on my abalone data, I could not try the scaler() function, because it wants train, test and validation dataframe at the same time.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README

[x] Installation instructions: for the development version of package and any non-standard dependencies in README

[ ] Vignette(s) demonstrating major functionality that runs successfully locally

[x] Function Documentation: for all exported functions in R help

[x] Examples for all exported functions in R Help that run successfully locally

[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.

[x] Functionality: Any functional claims of the software been confirmed.

[ ] Performance: Any performance claims of the software been confirmed.

[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hrs

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Nice job! You have detailed information about the target, installation, function features and usage in your README. It really helps me get familiar with your package immediately. I think your package would really be useful during the data preprocessing in machine learning! And I also have some suggestions for you. I hope that they can help you polish your package.

Though you include a vignette in the vignette file, it would be better to give a link of the knitted version of vignettes in your README, so that people can approach and look through it more conveniently. Vignettes are very important since it demonstrates major functionality that runs successfully locally. People can follow the examples in the vignettes to learn how to use the functions and get familiar with your package more easily and quickly.

The link in the Documentation part is broken. I don't know if it works on your computers. I tried several times, but it is still unable to work. I think it is quite a big issue you should look into immediately. Every time I use a new function, I would look at its documentation first. Without documentation, it would be really hard for me to use a function.

The usage included in your README is very detailed! I am wondering if it will be better you give a demo to each function in the README. Create a synthetic data, or use a dataset people are familiar with(such as Gapminder, mtcars), show the arguments passed into the function and the output of the function. With a demo, people would know what data is suitable for this function and what to expect after applying this function.

When I tried `install.packages("PrepR"), I got an error message as following: Then I tried the development version from GitHub, fortunately, it works!

There is a type error in your README. The One Hot Encode function should be onehot(), while in the README, it is one-hot()

When I used the abalone data to test your functions, the data_type() works out. However, the train_valid_test_split() fails. It gives me an error message: "Please provide a non-empty data.frame object for x". At first, I think it fails since my data is empty. After I checked my data, I found out that the function only accepts dataframe. I think lots of our data in R is tibble, so only accepting dataframe will make your function less applicable and popular. Besides, I think you can also edit your error message and say directly, like "the input should only be a dataframe!". The same problem also exists in the onehot() function. Since I passed a dataframe, I got an empty dataframe as a result. As I can not perform the train_valid_test_split() successfully on my abalone data, I could not try the scaler() function, because it wants train, test and validation dataframe at the same time.

@zouwenjiao Thank you for your valuable feedback! Due to group member commitments and time constraints, we did not address all of your points but did act upon three of your suggestions:

Changed the error messages returned by onehot and train_test_valid_split for incorrect data type inputs
Debugged onehot so it can accept a dataframe and return the correct output
Fixed the link for Documentation in README.md

Here is the link to the final release of PrepR which includes this change: https://github.com/UBC-MDS/PrepR/tree/v1.2.0

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README

[x] Installation instructions: for the development version of package and any non-standard dependencies in README

[ ] Vignette(s) demonstrating major functionality that runs successfully locally

[x] Function Documentation: for all exported functions in R help

[x] Examples for all exported functions in R Help that run successfully locally

[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.

[ ] Functionality: Any functional claims of the software been confirmed.

[ ] Performance: Any performance claims of the software been confirmed.

[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hi Group 15! Thanks for letting me review your package. I have some comments and suggestions below:

Just a small note: I would change the title of this issue to include the correct name of the package (i.e. PrepR)

Although the README contains most of the important information, I suggest adding some additional content:

Outputs of the functions using pre-loaded data

A link to a vignette that explains how the functions are to be used - I noticed that you included a Read the Docs link, but it seems to be broken. Perhaps you want to use vignette and pkgdown to setup a website for the package instead. In lecture 6, Tiffany has provided a helpful walk-through on how to do exactly this.

Citation information - this should be included so that users know how to cite the package.

As for feature requests, here are some ideas that came to mind:

With respect to scaler(), it would be nice to also have an option analogous to scikit-learn's MinMaxScaler()

On a related note, it would also be nice to have the option to drop the first column in onehot()

With respect to functionality, I had a problem with onehot(). Although the example provided in the documentation runs properly, I ran into issues when trying to perform one-hot encoding on a data frame I created.

When I call ?onehot to read the documentation, it says that the function takes in a data frame of categorical features.

I decided to create a data frame with two columns of categorical data. Here's what happened when I tried calling the function:

I tried debugging the function line by line and it appears to be lines 31-34 that throw the error:
  for (level in levels(encodable_df[[header]])) {
    columns <- c(columns, level)
  }
That's all for now, but please let me know if you require any clarifications.

Reiko

@reikookamoto Thank you for your valuable feedback! Due to group member commitments and time constraints, we did not address all of your points but did implement several of them:

Include examples of the function outputs using pre-loaded data which we somehow overlooked
Typo in the title of this issue which we also caught
Debugged onehot() so that it can take dataframes with multiple columns of categorical variables.
Added feature to onehot() so the user can choose to drop the first column

Here is the link to the final release of PrepR which includes this change: https://github.com/UBC-MDS/PrepR/tree/v1.2.0

UBC-MDS / software-review

Submission: PrepR (R) #12

Scope

Technical checks

Publication options

Code of conduct

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments