Submission: eaziReda (R)

Submitting Author:

Vignesh Lakshmi Rajakumar (@vigneshrajakumar)
Dustin Andrews (@dbandrews)
Arash Shamseddini (@arashshams)
Yuyan Guo (@yuyanguo)

Repository: eaziReda Version submitted: 0.2.0 (TBD) Editor: TBD Reviewers: TBD

Archive: TBD Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:

Package: eaziReda
Title: A Quick And Easy Way To Do EDA And Preprocessing
Version: 0.0.0.9000
Authors@R:
    c(
    person(given = "Dustin",
           family = "Andrews",
           role = c("aut", "cre"),
           email = "dandrew9@student.ubc.ca"),
    person(given = "Vignesh",
           family = "Rajakumar",
           role = c("aut")),
    person(given = "Arash",
           family = "Shamseddini",
           role = c("aut")),
    person(given = "Yuyan",
           family = "Guo",
           role = c("aut"))
    )
Description: Almost every data analysis project involves the process of doing some exploratory data analysis(EDA) and data preprocessing. 
  Usually they serve as a very crucial and inevitable step in a data analysis workflow. 
  There are some very common tasks in EDA, which can include checking missing values, detecting outliers, ploting correlation plots between features
  and ploting histograms/bar plots for each individual features.
  Typically these steps are followed by some preprocesing like imputation and dealing with outliers. 
  All of those steps together may require lots of coding effort and can be repeated for several projects. 
  To solve this issue, we designed this R package eaziReda that wraps all of those lines of code into four convenient 
  functions that will allow you to quickly and easily carry out EDA along with some simple preprocessing using just four lines of code!
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Imports: 
    magrittr,
    ggplot2,
    dplyr,
    cowplot,
    tidyr,
    tidyselect,
    rlang,
    vdiffr (>= 0.3.3),
    tibble,
    isotree,
    data.table,
    purrr
Suggests: 
    testthat (>= 3.0.0),
    covr
Config/testthat/edition: 3
Remotes: 
    r-lib/vdiffr
URL: https://ubc-mds.github.io/eaziReda, https://github.com/UBC-MDS/eaziReda
BugReports: https://github.com/UBC-MDS/eaziReda/issues

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
- [x] exploratory data analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

eaziReda has the functionality to produce interactive plots (e.g. histograms and correlation plots) to graphically demonstrate the distribution and correlation of features inside a given dataset. Another functionality of eaziReda is data wrangling since at its core it is designed to deal with missing data and outliers.

Who is the target audience and what are scientific applications of this package?

The target audience would be those who are interested to get an interactive visualization of the dataset at hand and also people who wish to do a quick data munging especially if their dataset contains missing values and outliers.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

There are similar R packages such as "SmartEDA" or "dlookr", but eaziReda's functionality is to address the most-wanted EDA and Data wrangling jobs quickly and conveniently. Another difference is that eaziReda is quite light weighted.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[ ] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Briefly describe any working relationship you have (had) with the package authors.

I am a fellow colleague in the UBC MDS program, and have worked with the authors of this package in the past. There is no conflict of interest.

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software

[x] Authors: A list of authors with their affiliations

[x] A statement of need clearly stating problems the software is designed to solve and its target audience.

[x] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 3

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Before getting to the suggestions for improvement, I just wanted to say that I think you all did a great job. The install was smooth, all of the examples run for each function, and the vignette was easy to follow. The target audience, and purpose of this package were very clear to me. Overall I had to get very nit-picky to find feedback.

Checks:

devtools::check() - Pass
devtools::test() - Pass
Code Coverage - 100%
spelling::spell_check_package(): No spelling mistakes in the Readme.md. Some minor spelling mistakes (see below).
Vignette - Runs
Examples within roxygen2 documentation: Runs

Constructive Feedback:

[ ] Lines longer than 80 characters: Multiple files have lines longer than 80 characters, and this can hurt readability. The files include corr_plot.R, histograms.R, missing_detect.R, missing_impute.R, test_corr_plot.R, test_missing_detect.R, and test_missing_impute.R. RStudio's styler should fix these easily.
[ ] Avoid using sapply: The good practices package flagged the use of sapply in missing_detect.R as dangerous as it may return a list or vector. Perhaps use vapply.
[ ] Avoid T, F: The good practices package flagged the use of "T" and "F" in outliers_detect.R. "TRUE" and "FALSE" should be used in their place.
[ ] Minor spelling mistakes: After running spelling::spell_check_package(), there are some minor spelling mistakes in .rd files in the man folder. Could be these were generated before spellchecking function documentation. There are also some minor spelling mistakes in the roxygen2 documentation. Running the spellcheck will give the exact locations.
[ ] Testing Documentation: Looks clear, and the labels are easy to read to identify the test. Maybe the test-histograms.R could use a few more comments.
[ ] Function Documentation: The outliers_detect() could use some additional documentation when it comes to the different methods for outlier detecting. I think the vignette would be a great place to either have a quick link or definition for each method. At least for me, I was unfamiliar with the iforest method.

Thanks @dusty736 for the review - we'll definitely be fixing the majority of issues you've raised here this week.

@dbandrews - Absolutely! Great job!

Package Review

Briefly describe any working relationship you have (had) with the package authors.
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions
[X] Examples (that run successfully locally) for all exported functions
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 3 hours

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Hi eaziReda developers,

Thanks for delivering this great package to R society. I really enjoyed reading through your vignette document and I was inspired by your great ideas in your package. I was able to successfully install the package and run the vignette file. The design of the histograms are neat and professional. The code of all functions are easy to follow and well commented. There are a few thoughts that you may consider to implement in eaziReada package in the future.

It would be helpful to add sorting in the current missing_detect() function. Consider that we have a wide dataset (many features) and only one or two contain missing values, you don't want to display detecting results for all columns. You could let the columns with missing values stand out by sorting your current output either by n_missing or percent.
Users without knowledge on the three methods in outliers_detect() may find it hard to understand. You can add some brief descriptions for them in README.md or vignette file.
In the README file, you are missing one description for missing_detect() function.
The function order in README file and Vignette are inconsistent. I think the order in Vignette file makes more sense. You can consider to switch the description for corr_plot and remove_outlier() in README file.
In the function missing_impute(), any columns with numerical values are currently considered as numerical features. It may not be the case. If we have a happiness level column, which contains values like (1, 2, 3, 4, NA), the function will return 2.5 by using mean method. This output doesn't fit well into the data. And the following histogram output will be hard to interpret. This problem may be hard to resolve, but it would be helpful to acknowledge the users about this in the Vignette.

Overall, you all did a great job on this project and I can see analysts using it in the future! Hope my thoughts above are not hard to follow. Please feel free to contact me if you have any questions or concerns!

Thanks, Ivy

UBC-MDS / software-review-2021