Submission group 12: simplerfit(R)

Submitting Author Name:Zihan Zhou
Submitting Author Github Handle: @zzhzoe Other Package Authors Github handles: Mohammadreza Mirzazadeh @rezam747 Navya Dahiya @nd265 Sanchit Singh @Sanchit120496 Repository: https://github.com/UBC-MDS/simplerfit Version submitted: Standard Submission type: Standard Reviewers: Abhiket Gaurav @Abhiket, Sufang Tan @Kendy-Tan, Lakshmi Santosha Valli Akella @valli180, Pavel Levchenko @plevchen Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: simplerfit
Title: Clean data, perform EDA, fit classifier or regressor models and return model performance scores
Version: 0.0.0.9000
Authors@R: 
    c(person(given = "Navya ",
           family = "Dahiya",
           role = c("aut", "cre"),
           email = "navyad265@gmail.com",
           ),
    person(given = "Reza",
           family = "Mirzazadeh",
           role = "ctb",
           ),
    person(given = "Sanchit",
           family = "Singh",
           role = "ctb"),
           person(given = "Zhian",
           family = "Zoe",
           role = "ctb"))
Description: This package helps data scientists to clean the data, perform basic EDA, visualize graphical interpretations and analyse performance of the baseline model and basic Classification or Regression models, namely Logistic Regression, Ridge on their data.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Suggests: 
    covr,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Imports: 
    readr,
    caret,
    mltools,
    GGally,
    ggplot2,
    data.table,
    devtools,
    tidyr,
    gapminder,
    dplyr,
    stringr,
    rlang,
    stats,
    monomvn

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [x] data extraction
- [x] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

A R package that cleans the data, does basic EDA and returns scores for basic classification and regression models. This package helps data scientists to clean the data, perform basic EDA, visualize graphical interpretations and analyse performance of the baseline model and basic Classification or Regression models, namely Logistic Regression, Ridge on their data.

Who is the target audience and what are scientific applications of this package?

Any data professionals at the entry-level who would like to conduct a quick exploratory data analysis. A data scientist spends a lot of time writing same syntactical code for carrying out data processing, transformations, fitting models and comparing their performances.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[ ] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[ ] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of the package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[ ] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software have been confirmed.
[x] Performance: Any performance claims of the software have been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5hrs

Review Comments

Going through the package I am amazed by the ideation and conceptualization of this project. I am sure this package has lots of practical applications. However, I think there are still a lot of unanswered areas and the user as of now has to use this package with a pinch of salt.

Similar to my review of the Python version, I feel that the followings are the areas where we need to pay attention:

The scope and usability of this package could be more explicit. Also a few sentences on dummy classifier or dummy regressor which we are using as a baseline should be explained.
Removing all the rows that contain missing values is not an ideal way to deal with missing values. As we are now aware that there are various ways of doing missing value imputations, some of them can be implemented here.
The vignettes need to be more elaborate and self-explanatory. Right now it does not seem to explain in detail. The function of cleaning the dataset is not very clear. Can it be used to clean the data directly taken from a URL? Or the user has to remove the header and footer and then call the function to clean the dataset.
Choice of choosing the plot type (histogram, scatter, box, etc) can be given as an input.
Since we are doing EDA, we need to slice and dice the data, the option of faceting can also be added.
The correlation plot does not seem to be appearing properly.
The function here is using logistic regression for binary classification. Request to kindly mention it in the scope of the project in the README section.
For the case in point, accuracy seems to be a good metric, as a user, one may not want to use accuracy as a metric. The choice of metric can also be an input to the function.

Overall, I was able to install the package and use it on a few toy datasets. All the functions within the package are working well and as per the expectation. However, I feel the documentation needs to be more elaborate, and self-explanatory even to a naive user. Given the time constraint, I have huge respect for the team. I am sure if developed to the fullest, this package can rock the data-science world!!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[ ] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hrs

Review Comments

Good job team! Thank you for coming up with such a useful package, and I can't wait to use it in my real later project. however, I still have a few suggestions to make this function more versatile:

The cleaner function can include more functions like data imputation for numerical columns by replacing the missing value with mean/median/ most frequent value.
Both the classification and regression function could add more model options like decision trees and other non-linear terms been to add in regression.
In the EAD plotting, you can faceting to see the correlations among different categorical groups. Although there is the docs badge, it would be great if the link is given in the "About" section.
You can add more options in the plot type like histogram, scatter, the box that allows users to choose the most appropriate one for their investigation problem.
If the package is limited to a typical structure of the data set, you should remind the user in the README.md.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions
[X] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1 hour

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Nice work! Some of those could be useful for actual EDA work in the future. Some thoughts on potential improvements:

Maybe the structure of the package should be more focused on either visualization or metrics. Combining both in the same package could be confusing to the end user
In fit_regressor function users do not have options to select type of regression. Better approach could be assign some type as default in function definition, but provide a user an opportunity to select from multiple options
Similar comment for fit_classifier function. A lot of calculations are hardcoded, not being flexible for user needs
Cleaning function is too restrictive. This could be significantly modified with calculating number of missing observations, with making assumption whether that missingness can be treated as MAR, MNAR or MCAR with using different imputation methods.
For tests, I think better option is to use small tibbles instead of full csv. It is much easier to spot what is the issue in such case, rather than working with a complex dataset

Great work with defensive programming. All functions were very well thought in terms of edge cases and what warnings/errors should be raised

UBC-MDS / software-review-2022