Submission: aRidanalysis (R)

Submitting Authors:

Santiago Rugeles (@ansarusc)
Neel Phaterpekar (@nphaterp)
Daniel Ortiz (@danielon-5)
Craig McLaughlin (@cmmclaug)

Repository: https://github.com/UBC-MDS/aRidanalysis Version submitted: 0.3.0 Editor: Tiffany Timbers (@ttimbers) Reviewers: TBD

Archive: TBD Version accepted: TBD

DESCRIPTION:

Package: aRidanalysis
Title: DRY out your regression analysis!
Version: 0.3.0
Authors@R: 
    person(given = "Santiago",
           family = "Rugeles Schoonewolff",
           role = c("aut", "cre"),
           email = "nphaterp@ualberta.ca")
    person(given = "Daniel",
           family = "Ortiz Nunez",
           role = c("aut", "cre"),
           email = "daniel.ortiz.n@gmail.com")
    person(given = "Neel",
           family = "Phaterpekar",
           role = c("aut", "cre"),
           email = "nphaterp@ualberta.ca")
    person(given = "Craig",
           family = "McLaughlin",
           role = c("aut", "cre"),
           email = "cswag7@student.ubc.ca")
Description: The aRidanalysis package allows users to streamline their regression analysis by providing functions that return model objects with familiar sklearn interfaces. An EDA function and model functions for different regression analysis response types are provided to simplify the analysis and modelling process and reduce code duplication. These model objects will allow users to leverage the inference capabilities of the R ecosystem with the prediction pipelines familiar in Python.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Imports: 
    dplyr,
    tidyr,
    glmnet,
    palmerpenguins,
    ggplot2,
    GGally,
    tidyverse,
    MASS,
    broom,
    AER,
    tibble,
    rlang,
    stringr,
    magrittr
Suggests: 
    testthat (>= 3.0.0)
Config/testthat/edition: 3

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [x] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
- This aRidanalysis package falls under data munging because we provide functions to perform exploratory data and instantiate regression analysis models on provided data.
- This aRidanalysis package falls under workflow automation given the provided function interfaces simplify and logically translate the inputs to other libraries to streamline the generation of visualizations and models.
Who is the target audience and what are scientific applications of this package?
- The target audience is newer machine learning enthusiasts familiar with the Sci-kit Learn Python interfaces looking to streamline their machine learning workflows in the R language.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
- Yes, the superml package essentially implements almost the entire scope of the Ski-kit Learn functionality. Our package is much more focused, with a much simpler interface not requiring the parsing of a huge volume and range of functionality.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
- Yes, this package is focused on very general data regression analysis.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[ x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software

[x] Authors: A list of authors with their affiliations

[x] A statement of need clearly stating problems the software is designed to solve and its target audience.

[x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 1

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Since the goal of the package is make all of the different functions more accessible maybe there could be a more intuitive way to input all of the information as arguments into the different functions. I.e., one argument could be used to input the df and then more could be used to specify the columns.
Good job including other packages with similar functions. This makes the purpose of the function much clearer to anyone reviewing the package.
For the packages which are used for visualization perhaps the addition of optional features that allowed for adjustment of some parameters would have been useful. For example, allowing users to input graph or axis titles or the use of different color pallets.
In the vignette/examples for the regression functions it would be easier to interpret all of the examples if each of the reported values were placed within a string (i.e. "Model score is ____")
The documentation doesn't appear to contain detailed information about each of the different functions (i.e. does not contain information about the docstrings/ different arguments). Using the pckdown package it should be possible to generate docs with links to multiple pages (for each function) and full details about arguments.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[ ] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software

[x] Authors: A list of authors with their affiliations

[x] A statement of need clearly stating problems the software is designed to solve and its target audience.

[x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hi group 8,

Great job on this package development. I find the package has a clear purpose and the functions are easy to use. Several comments:

In aridanalysis.R, for function arid_eda, I was not able to run the example in the docstring arid_eda(house_prices, 'price', 'continuous, c('rooms', 'age','garage')). I believe continuous is missing a quote on the right and dataset house_prices is not loaded.
I like the clear explanations for each function in the vignette. It's great that the introduction mentions that this package is designed for bridging the gap of R and Python users. I find the comparison between machine learning and statistical analysis a little bit unnecessary though.
The readme is clear and concise. Great job! I appreicate the inclusion of related packages.
I saw that in testthat folder, for tests of countreg, each testthat code block uses repetitive data generation code. It will be great if helper data could be used to make the script DRY.
The functions are well written. I'm impressed by how defensive each function is. This is awesome!

Thank you for your feedback @MattTPin!

I'll do my best to address your comments below:

Since the goal of the package is make all of the different functions more accessible maybe there could be a more intuitive way to input all of the information as arguments into the different functions. I.e., one argument could be used to input the df and then more could be used to specify the columns.

This was actually the design of our original interface! We decided to change back to the standard Sci-Kit Learn/R X,y dataframe interface as a way to standardize the process of using these packages. We will discuss and reconsider, but I think there is some value in having one consistent usage pattern.

Good job including other packages with similar functions. This makes the purpose of the function much clearer to anyone reviewing the package.

Thanks! It's certainly well covered ground so we had to try to make our package distinct.

For the packages which are used for visualization perhaps the addition of optional features that allowed for adjustment of some parameters would have been useful. For example, allowing users to input graph or axis titles or the use of different color pallets.

This is great feedback, and certainly something to add to the next iteration of improvements.

In the vignette/examples for the regression functions it would be easier to interpret all of the examples if each of the reported values were placed within a string (i.e. "Model score is ____")

This was also our original implementation idea for interpretability. In the end, we decided to mimic the Sci-Kit Learn output as close as possible to allow users the flexibility to modify their result strings as desired.

The documentation doesn't appear to contain detailed information about each of the different functions (i.e. does not contain information about the docstrings/ different arguments). Using the pckdown package it should be possible to generate docs with links to multiple pages (for each function) and full details about arguments.

I think an example of the function documentation you're looking for is here, but this might be telling us to make this information more accessible!

Thanks again for taking the time to review our package, and please let us know if there are any follow-up questions!

aRidanalysis Team

Hi @xudongyang2, thank you for taking the time to review our package! I will do my best to address your comments below:

In aridanalysis.R, for function arid_eda, I was not able to run the example in the docstring arid_eda(house_prices, 'price', 'continuous, c('rooms', 'age','garage')). I believe continuous is missing a quote on the right and dataset house_prices is not loaded.

Thanks for catching this! We have created the following issue to address this: https://github.com/UBC-MDS/aRidanalysis/issues/65

I like the clear explanations for each function in the vignette. It's great that the introduction mentions that this package is designed for bridging the gap of R and Python users. I find the comparison between machine learning and statistical analysis a little bit unnecessary though.

This is good to know, we will discuss and determine how to best communicate with our audience!

The readme is clear and concise. Great job! I appreicate the inclusion of related packages.

Thanks! We think it was important to differentiate our package due to the relatively crowded regression package ecosystem.

I saw that in testthat folder, for tests of countreg, each testthat code block uses repetitive data generation code. It will be great if helper data could be used to make the script DRY.

This is definitely a good idea to reduce code duplication, we have added a new issue to track this: https://github.com/UBC-MDS/aRidanalysis/issues/66

The functions are well written. I'm impressed by how defensive each function is. This is awesome!

Thanks! We certainly wanted our package functions to be robust and clearly explain any erroneous inputs for ease of use.

Thank you again for your feedback. We have added the issues above to help make the improvements you have suggested!

UBC-MDS / software-review-2021