Submission: mealprep (R)

Submitting Authors: Anny Chih (@annychih), Sam Edwardes (@SamEdwardes) , Huayue Lu (@mglu123) , Jarome Leslie (@jsleslie )
Repository: mealprepR
Version submitted: v1.1.0 Editor: TBD
Reviewer 1: @doraqmon
Reviewer 2: @schepal Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: mealprepR
Title: Helps Prepare Your Data Meal
Version: 0.0.0.9000
Authors@R: c(person("Anny Chih", "Developer", role = c("aut", "cre"),
                      email = "anny@annychih.com"),
              person("Huayue (Luke) Lu", "Developer", role = "aut"),
              person("Jarome Leslie", "Developer", role = "aut"),
              person("Sam Edwardes", "Developer", role = "aut"))
Author: DSCI 524 Group 14
Description: Helps prepare your data meal by separating numerical and categorical variables (fruits_and_veg), finding missing data (missing_ingredients), finding outliers (bad_apples), and preprocessing data (slice_and_dice). 
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
Imports: 
    devtools,
    vctrs,
    lifecycle,
    pillar,
    janitor,
    dplyr,
    magrittr,
    testthat,
    tidyr
Depends: 
    caret
Suggests: 
    covr,
    knitr,
    rmarkdown
URL: https://github.com/UBC-MDS/mealprepR
BugReports: https://github.com/UBC-MDS/mealprepR/issues
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [ ] workflow automataion
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package includes three functions, find_fruits_veg(), find_bad_apples() and 'find_missing_ingredients(), which perform diagnostics on a data frame and one function,make_recipe()`, which implements a sequence of preprocessing steps.

Who is the target audience and what are scientific applications of this package?

The target audience for tools provided in mealprepR include all data scientists embarking on a new data analysis pipeline.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

See the mealprepR and R’s Ecosystem section of our package README.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

N/A

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Thank you for inviting me to review your package!

Overall, you did a great job creating this data preprocessing package. I really like the name of your package and functions, they are very creative! I had no problem installing and testing your package. And the readme is very coherent and easy to follow. All my suggestions are listed below.

Please feel free to leave a comment below if you have questions regarding my review.

General Comments:

After running devtools::check() from your package root, I got 0 errors but 2 warnings related to the dependencies in your packages, see screenshot below:

You might need to modify your Description file to resolve these warnings.

I have also tried running devtools::test() and 'covr::report()', all the units tests passed and the test coverage is 100%, great work!
Using goodpractice::gp(), I got some suggestions for your code. Most of them are related to the long code lines, which might not be good for readability. There is also an issue related to the assignment symbol used, you might consider using <- instead of = in your find_bad_apple.R, find_fruits_veg.R, test-find_fruits_veg.R functions.
Using spelling::spell_check_package(), I noticed there are some spelling errors in your README as well as vignette files. I would suggest you run this command and fixed all the typos.

Specific Comments

README Documentation

Under the installation section, both CRAN and GitHub instructions are included. Since your package has not been published to CRAN, I would suggest removing the CRAN installation from the readme file.
It would be useful for users if you could add the input and output type under each of your functions in the readme file.
You are using different input datasets for your examples. For consistency, I would recommend using the same dataset for all 4 toy examples. For example, you could use the iris dataset and insert some NA or outlier values for testing.
The code chunks in examples for find_bad_apples() are too long and not good for readability, I would suggest change the input dataset or make it shorter.

Function 1: find_fruits_veg

This function worked really well on different datasets I tried. It could be written more defensively. Specifically, the function should raise a warning/message if the input parameter type_of_out is not equal to num nor categ.

Function 2: find_missing_ingredients

The output of this function is very informative and would be very useful for identifying missing values. The only area for improvement is to remove the last 2 commented lines in this function for code readability.

Function 3: find_bad_apples

One recommendation here is to include the data range and the threshold value of identifying as an outlier in the output. If the column has some extremely large/small values, the approach of using mean plus/minus 2 standard deviation would remove quite an amount of the points in the data. Adding the information of data range and threshold values would help users get a more clear picture of whether they should drop the outliers.

Function 4: make_recipe

This function is a very powerful tool to use when doing machine learning in R. The roxygen is well-documented. The defensive part needs to be improved. In your code, only 2 parameters recipe and splits_to_return are checked in the code. You might consider adding tests and defensive code writing for the other 4 parameters (i.e. X, y, random_seed, train_valid_prop)

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions in R help
[X] Examples for all exported functions in R Help that run successfully locally
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Review Comments

I found the naming convention of this project to be very creative and it made me smile! Well done on brainstorming interesting functions and a clever package name. I successfully installed the package using the devtools::install_github("UBC-MDS/mealprepR") command in my RStudio console. I appreciated the in-depth examples in the README document as this allowed me to confirm that the library package was properly installed on my machine. However, I was not able to install the package from CRAN using the install.packages("mealprepR") code chunk as shown below:

Furthermore, I found that the load_all() command in the installation instructions did not properly work on my machine. I was still able to load the mealprepR library, however, I did receive the following error below when running load_all().

Individual Function Comments

find_fruits_veg():

I was able to successfully use the iris dataset and retrieve the numerical index columns using this function. I did however struggle with determining which particular string I needed to feed into the second argument to retrieve the categorical columns. When I tried to run the following command find_fruits_veg(iris, 'cat') the console did not return anything as I assumed if num is numerical then cat would be categorical. After reviewing the roxygen documentation, I realized that in order to get the categorical indexes, we have to use find_fruits_veg(iris, 'categ'). As a result, it could be helpful to provide further detailed documentation in the README on the exact arguments that can be inputted into the find_fruits_veg() function.

find_missing_ingredients():

I appreciated the key point statistics produced by this library such as the proportion of items with NaN values and indices which have NaNs. I was able to get the function to work seamlessly on my machine, however, there were a few warning messages which appeared as shown below:

find_bad_apples():

Outlier detection is a critical tool used across many industries and I can see how this particular function can provide value. The function was relatively easy to use, however, for future iterations consider adjusting this function so that both the indices and columns are returned in the same tibble. Furthermore, I received the same warning message above as find_missing_ingredients() when running this function.

make_recipe():

This was my favourite function in the package - very well done! I was impressed by its simplicity and overall effectiveness by just writing several lines of code. Furthermore, I was able to successfully run this function with no warning errors and received a clean output of the preprocessed data. The only area of improvement I can see here is to provide more "recipes" in addition to one-hot encoding in future iterations.

Areas of Improvement

Some of the function names are not particularly intuitive on their intended purpose. For example, on an initial glance it can be challenging to understand the difference between find_missing_ingredients() and find_bad_apples(). Despite the creative and fun names, this may pose as a hinderance in a professional working situation where the stakes are high and developers are expected to maintain clean and working code. Unconventional function names may complicate the software review process and potentially bottleneck the development pipeline. Many software developers strongly suggest naming functions and variables which closely align with their intended use-case to ensure simplicity in the code review process.

Overall fantastic work on this project - it was an enjoyable experience interacting with the library! I was impressed by the effectiveness of this package and I can see it being used by industry professionals within the data science community.

@doraqmon

Thank you for your reviewing!

All your points are valued, but with limited time only following issues are addressed.

Adressed issue:

Third point in 'General Comments'. I have change the '=' to '<-' in R script.
'Specific Comments' regrading to function find_fruits_veg. I have edited the function in a more defensive way by checking if the input of type_of_out is correct or not. Also, updated the corresponding test file as well.

The new released version is [here]()

@schepal

Thanks for taking time to review our package! You've given us a lot of constructive feedback to work with. Of the points you made I've made the following updates:

Function find_fruits_veg() description in the README does not explicitly specify the input argument to identify categorical columns. Accordingly, I've expanded the example for this function to show both the numeric and categorical cases. In addition, I've printed a segment of the source data to add more context to the function output.

I have not implemented the change to remove the specific warnings received as one potential solution would increase the number of dependencies. This is something that we'll take into account in future updates.

The new release version of this package may be found here

All the best!

UBC-MDS / software-review