Submission Group 24: nedahelpeR (R)

Submitting Author Name: Michelle Wang(@michelle-wms), Jiwei Hu (@hjw0703), Anupriya Srivastava (@Anupriya-Sri), Sam Quist(@squisty)

Repository: (https://github.com/UBC-MDS/nedahelpeR) Version submitted:1.0.0 Submission type: Standard Editor: Michelle Wang(@michelle-wms), Jiwei Hu (@hjw0703), Anupriya Srivastava (@Anupriya-Sri), Sam Quist(@squisty)

Reviewer 1: Ruben De la Garza Macias Reviewer 2: Jennifer Hoang Reviewer 3: Harry Chan Reviewer 4: Junrong Zhu

Paste the full DESCRIPTION file inside a code block below:

Package: nedahelpeR
Title: Streamline Exploratory Data Analysis
Version: 0.0.0.9000
Authors@R: 
    c(
    person(given = "Jiwei", 
    family = "Hu",
    email = "hjw0703@student.ubc.ca", 
    role = c("aut", "cre")),
    person(given = "Anupriya",
    family = "Srivastava", 
    role = c("aut")),
    person(given = "Michelle",
    family = "Wang", 
    role = c("aut")),
    person(given = "Samuel",
    family = "Quist", 
    role = c("aut"))
    )
Description: Data scientists often spend a lot of time preprocessing data to extract useful parts for their analysis.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Suggests: 
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Imports: 
    purrr,
    dplyr,
    tibble,
    data.table,
    tidyr,
    stats,
    utils
VignetteBuilder: knitr

Scope

Please indicate which [category or categories][PackageCategories] this package falls under:
- [x] Data retrieval
- [x] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see [notes on categories][NotesOnCategories] of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

The package includes functions which can complete the following tasks:

Display some useful statistics
Detect outliers
Deal with missing values
Check for correlation between features

Who is the target audience and what are scientific applications of this package?
Data scientists who often spend a lot of time preprocessing data to extract useful parts for their analysis.
Are there other R packages that accomplish the same thing? If so, how does yours differ? In the R open-source ecosystem, there exists some useful packages that already tackle EDA and preprocessing, but our goal is to make it even more light-weighted, fast and specifically tailored to present numeric EDA insights. One popular and useful package that can generate EDA reports is smartEDA.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x]has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[x] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.: We are in the same cohort of the UBC MDS Program. I'm the reviewer of this package under the course DSCI524

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software have been confirmed.
[x] Performance: Any performance claims of the software have been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1 hour

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer (“rev” role) in the package DESCRIPTION file.

Review Comments

Keep up the great work! You made a great improvement to build up the R package for everyone. Here are my suggestions.

For your function, it is recommended to use return() only for early returns and rely on the final result of the last evaluated expression. Here are the details of when to use the return() here
In function get_correlated_features(), I think the input argument threshold can be assigned to a default value. It makes the function more intuitive and user-friendly
In function flag_outliers, I tried a dataframe with categorical columns. It didn't return a proper error message. It would be great to return error messages such as "The input data must be numeric".
in function flag_outliers, it would be better if you can include a brief description of how the threshold value works under the hood. Some readers may want to get more details.
I understand that the input dataframe should be numeric. However, the real data usually consists of different data types such as categorical, binary, ordinal. It would be great if you can include other data types as well in your functions for future improvement.

Overall, I was just being picky and I really find this package very useful!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no [conflicts of interest](https://devguide.ropensci.org/policies.html#coi) for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1 hr

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Great work again everyone! Please find my suggestions below:

In the function documentation of missing_imputer, it would be nice to list the possible methods of imputation in the roxygen2 comments, similar to how it was shown in the python version of the package.
In the function documentation of flag_outliers, it would be useful to clarify how NA values are handled by the function in the percentage calculations (i.e. are they included or excluded?).
In the function correlated-features, exception handling could be added for the consider_sign argument.
In the function correlated-features, an error could be raised when a data frame contains a non-numeric column to improve usability.
There are a few minor tidyverse style guide inconsistencies in some of the scripts and tests, such as spacing around operators. Using the styler feature in RStudio could help address these points and improve code readability for future users and contributors.

Best, Jennifer

title: "review" output: rmarkdown::md_document: pandoc_args: [ "--wrap=none" ]

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1hr

[ ] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Great package and everything worked as expected! Great job again everyone! (I had to be super picky to find any improvements you guys may need XD) please find my comments below:

It’d be better to specify the data type of the input data parameter in the Roxygen2 comments of the missing_imputer function
Additional tests can be added to check the output result of the overview function. So far you’ve checked the sum of the means and sum of variance, maybe you can include an additional one for the medians and stds, or just one for the entire df.
It’d be nice to add some code comments for the Overview function like how you guys did in the other three functions.
It’d be better to specify the data type of ‘threshold’ in the flag_outliers function.
Also, like the reviewer mentioned above, there are minor tidyverse style inconsistencies that you could improve if you got the time.

Review Comments

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Your package is impressive guys! I just have a few minor comments:

In the missing_imputer function argument description, I think it would be a good idea to specify what are the possible values that 'method' can get.
For the flag_outliers function (Roxygen), I would add a more detailed explanation about what the function is doing, it is not very clear to me how the function is flagging the outliers and how is it defining by the threshold which data point are actually outliers. In the vignette it is very clear how the function is actually working!
As an enhancement to the overview function, I would probably think adding some quantile information could be very useful, just to have a better idea of the distribution without having to see it.
While going through your functions it was kind of hard for me to read the code sometimes because of the way brackets are opened, closed and the indentation in some cases. It would be good to correct some of it, but this is something minor since the functions work as expected!
I did not see a lot of code inside your functions for handling edge cases, for example what happens when the user enters a df with an ID column and a numeric column, or cases when you only have one data point or an empty df. I do not think this will be a problem, but it is cool to have those cases covered when you never know what the user can input into the function.

UBC-MDS / software-review-2022