UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: xplrrr #44

Open bradentam opened 4 years ago

bradentam commented 4 years ago

Submitting Author: Braden Tam (@bradentam), Lori Fang (@lori94), Serg Pokrovskyy (@pokrovskyy), Furqan Khan (@fkhan72) Repository: xplrrr Version submitted: v0.3.5 Editor: Varada Kolhatkar (@kvarada) Reviewer 1: Keanna Knebel (@Keanna-K) Reviewer 2: Andres Pitta (@AndresPitta) Archive: TBD
Version accepted: TBD


Package: xplrrr
Title: This package makes exploratory data analysis (EDA) simple and seamless. EDA is a crucial phase in the data science workflow as it allows us get a fist glimpse of the data. It is important to identify statistical characteristics of the data so that we can properly set up the rest of the analysis.
Version: 0.3.5.358
Authors@R: c(person(given = "Furqan",
                   family = "Khan",
                   role = c("aut"),
                   email = "furqan@example.com"),
            person(given = "Serhiy",
                   family = "Pokrovskyy",
                   role = c("aut"),
                   email = "serg@pokrovskyy.com"),
            person(given = "Braden",
                   family = "Tam",
                   role = c("aut"),
                   email = "btam97@gmail.com"),
            person(given = "Yu",
                   family = "Fang",
                   role = c("aut","cre"),
                   email = "Yu@example.com"))
Description: What the package does (one paragraph).
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.0.2
Imports:
    tidyverse,
    purrr,
    ggplot2,
    gridExtra,
    nlme,
    matrixStats
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    covr,
    pkgdown
VignetteBuilder: knitr
URL: https://github.com/UBC-MDS/xplrrr
BugReports: https://github.com/UBC-MDS/xplrrr/issues

Scope

This package has 4 functions: data summarization, data visualization, identifying outliers, and identifying missing values.

Our target audience is people who are just starting to get into data science and are not too familiar with EDAs. This package will allow users to easily start their EDAs without getting into the nitty-gritty of hard coding.

DataExplorer is another EDA tool. Other than this package, there are not many available. Our package attempts to simplify the EDA process by reducing the amount of coding required. All you need to get started is a data frame which can then be fed into our functions.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

JOSS Options - [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

AndresPitta commented 4 years ago

Hi @bradentam, @fkhan72, @lori94 and @pokrovskyy

Here is the review:

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [x] A short summary describing the high-level functionality of the software
  • [x] Authors: A list of authors with their affiliations
  • [x] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 5h


Review Comments

General Comments

The information provided in the README file is clear and the problem to be solved is clearly stated. Installation instructions and dependencies are also stated in the README file and the vignette. The vignette contains good visual representations of the output of the model. However, for some functions, the documentation is not specific enough. Please, refer to the Comments Related to the Functions section to see the specific feedback on each function. The Contributors, Contributing, Conduct and License files are in order (The conduct is pointing to this mail mes335@cornell.edu) . The package addresses a very important part of a machine learning project workflow, good job on the topic selection.

Also, I saw something about reference on the list, but I am not clear about what to reference here since everything is 'created by us' (Maybe we should have referenced the cookiecutter repo). That is why I did not check that box in the form above.

Comments Related to the Functions:

-explore_summary:

Improvements: For some reason, this is the only documentation that is not rendering when I do ?explore_summary (Please let me know if you can do this one on your local machines. Also, I tried writing the @export at the end of the function but it did not solve the problem).

I feel like this function should cover more corner cases, like handling dataframes that do not comply with the use of the function. I tried explore_summary(nycflights13::airlines), which is a dataframe that only contains variables, and the error I get is ‘Error: Can't use numeric NA as column index with [.’ . I feel like the error here could be more informative in the context of the function.

Strong points: Good job on selecting only the numeric variables. In general, this function generates a pretty informative summary that is helpful during the EDA on a machine learning project. The example works correctly.

-explore_feature_map:

Improvements: Not much on this one.

Strong points: The documentation is very complete. The function handles many corner cases and they are explained in the documentation. Again, I tried explore_feature_map(nycflights13::airlines) and the error message I got was informative. The visualization is cool and useful for machine learning purposes. The example works correctly

-explore_outliers:

Improvements: The documentation could be more informative regarding the technique used to select outliers. Reading the code, I see it is using 2 standard deviations, but the documentation could be clearer.

The function does not work unless the whole dataset is numeric. I consider the function could just generate the outliers for the numeric variables and ignore the character variables. This function could adopt the explore_missing parameter type, to generate count and location.

Strong point: I like the use of this function. Using standard deviations is a very straightforward technique which is useful for the EDA stage of a project. The output of the function can be used for identifying critical variables. The example works correctly.

-explore_missing:

Improvements: I see the parameter location is mentioned on the general description but its use and expected output could be clearer.

Strong point: Having the parameter type was a good idea. This function covers a lot of corner cases, which is good. The output is really informative. The example works correctly.

I hope the comments are constructive and you can improve the package with them. I see your hard work.

Thanks guys,

Andres

Keanna-K commented 4 years ago

Hi @bradentam , @pokrovskyy , @fkhan72 , and @lori94,

Please find my review of your package below:

Package Review

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS (N/A)

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 5 hours


Review Comments

General Comments

Overall, I really enjoyed testing out and reviewing your R package, 'xplrrr'. The documentation stated a clear purpose and the repository was well organized, including all relevant documents/files (i.e. Contributors, Contributing, License, Conduct). The relevance of your package was made clear -- to aid in the process of exploratory data analysis -- and all functions fit coherently and within your designated scope. I found your package intuitive to use and learn as all your functions have similar naming conventions (i.e. explore_*) and ordering of arguments. Additionally, the documentation for the package installation and dependencies was clear and worked as described; however, it was noted that the documentation for the explore_summary function was not present when using R help.

When performing the review I also noted some minor changes required for the package to comply with the rOpenSci packaging guidelines.

  1. You should import any outside packages/ functions instead of using the library() and require() functions, as they change the global search path.
    • Specifically, importFrom("stats", "cor", "na.omit") to your NAMESPACE file
  2. There are a few lines in your code and function descriptions that are longer than 80 characters.
  3. You should be consistent with your use of '<-' for assignment, there are a few instances where '=' is used instead. '<-' is the standard, and R users and developers are used it and it is easier to read your code for them if you use '<-'.
    • Lines that use the '=' for assignment include: explore_missing.R:49:8 explore_missing.R:52:8 explore_outliers.R:20:14 explore_outliers.R:27:15 test-explore_outliers.R:3:6

As an aside, it was not indicated that you were interested in submitting to JOSS thus, this section of the review checklist was not filled out. If in the future you do decide to submit the package for publication I would be happy to review these extra criteria.

Function Specific Comments:

explore_outliers :

Improvements: It would be nice if it was explicitly stated how the outliers are being defined (i.e. 1.5x the IQR), this reinforces a sense of trust. when using someone else's package. Additionally, it would be nice if all the package functions were consistent in how they deal with non-numeric data frames/columns. Specifically, this function requires the entire data frame to be numeric whereas the explore_feature_map simply excludes the non-numeric columns. Strong Points: The function documentation was fairly clear and the examples all worked as described.

explore_missing :

Improvements: The function description could be stated more clearly, as I had some difficulties understanding the specification of outputs based on the documentation alone. Specifically, I would use a word other than 'explore' in the description as it is quite vague, consider using 'find', 'explain', or 'describe' instead. Additionally, there should be a test for the input type for the num_rows argument. When inputting a string (ex. num_rows = 'five'), the function simply returns ERROR but with no additional explanation on the error type or cause. Strong Points: Very useful function, especially when performing EDA. Examples work as expected.

explore_summary :

Improvements: As stated earlier, the documentation for this function does not show up when using R help. It would be beneficial if the function gave an explanation on how it deals with NA values and gave warnings when the data set be analyzed contained NA values. Strong Points: The function works well on various examples and is easy to understand.

explore_feature_map :

Improvements: Consider adding an option for handling data frames that contain NA values. Currently, if this function is performed on a data frame that has NA values, the function still runs without throwing any errors or warnings but the pairwise Pearson Correlation is not shown for the corresponding columns. Strong Points: Documentation of the function was clear and precise. The examples ran as described and the output is quite visually pleasing!

pokrovskyy commented 4 years ago

@Keanna-K & @AndresPitta, thanks for your valuable feedback and your time spent reviewing our package.

@Keanna-K, your feedback has been addressed in full except the minor cosmetic note on 80-symbol lines. Still coping with that :) Currently opened issue to track this last update can be found here https://github.com/UBC-MDS/xplrrr/issues/38

@AndresPitta, we addressed most of your feedback except handling more corner cases for the explore_summary function. The currently opened issue tracking progress of this item can be found here https://github.com/UBC-MDS/xplrrr/issues/33

Thank you!