Submission: Rcat (R) - Githubissues

Submitting Author: Eithar Elbasheer (@EitharAlfatih), Hanying Zhang (@HanyingZhang), Netanel Barasch (@TBarasch), Yingping (Dora) Qian (@doraqmon) Repository: https://github.com/UBC-MDS/Rcat Version submitted: 1.1.0 Editor: @kvarada Reviewer 1: Tejas Phaterpekar (@tejasph) Reviewer 2: Robert Blumberg (@RobBlumberg) Archive: TBD
Version accepted: TBD

DESCRIPTION file:

Package: Rcat
Title: a collection of EDA related functions
Version: 0.0.0.9000
Authors@R: 
    person(given = "Dora",
           family = "Qian",
           role = c("aut", "cre"),
           email = "kumakoq@gmail.com")
Description: Helps users to deal with missing and suspicious values, find the top correlated features during exploratory data analysis stage.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.0.2
Suggests: 
    testthat (>= 2.1.0),
    covr,
    knitr,
    rmarkdown
Imports: 
    vctrs,
    lifecycle,
    pillar,
    dplyr,
    stringr,
    naniar,
    reshape
URL: https://github.com/UBC-MDS/Rcat
BugReports: https://github.com/UBC-MDS/Rcat/issues
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [x] data munging
- [ ] data deposition
- [ ] workflow automataion
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences): Rcat is a package that provides a collection of convenient functions for Exploratory Data Analysis (EDA). As EDA is a process where data is transformed and analyzed, it falls under the category of data munging.
Who is the target audience and what are scientific applications of this package?
Our target audiences are the data scientists or anyone who is going to do a data science project. This package will help them with the EDA processes in their projects. It explores the whole dataset and reports it to the users.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are several existing packages in R that implement similar target in different aspects, e.g. SmartEDA, DataExplorer, inspectdf, but Rcat is trying to incorporate all things together.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Overall, I was quite impressed by the package's ability to tackle common inconveniences that coders face everyday in RStudio. Every function had a narrow, focused goal and was planned out well. Particularly, I found your README.md file to be organized, succinct, and easy to navigate. Below, are some observations I made as well as some small suggestions:

General

Code coverage was at an impressive 95%
Running devtools::check() gave no errors on my local computer
devtools::test() gave no errors upon running all your test functions
Some lines were > 80 characters

Functionality/Documentation

misscat
- Might consider re-wording threshold parameter (in your docs) as a proportion instead of ratio for a simpler understanding of the function
- Could potentially add a new variable that allows user to specify whether they care about columns or rows. This would increase the flexibility of the function
suscat
- Could consider moving all those internal tests to their own separate function because it takes up quite a bit of space inside the function. This is more of a preferential thing
repwithna
- According to goodpractice::gp(), consider using vapply instead of sapply because apparently sapply isn't "type safe". In other words, it might return a vector or a list depending on the input type. Using vapply would ensure this doesn't happen.
- When running devtools::test(), I observed the following warning:
```
test-repwithna.R:24: warning: Blank strings should be replaced.
funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

# Simple named list: 
list(mean = mean, median = median)

# Auto named with `tibble::lst()`: 
tibble::lst(mean, median)

# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
```
- To ensure compatability with future versions of dplyr, you may want to use a different function to replace funs
topcorr
- Maybe order the feature pairs in terms of the absolute correlation value. Let's say someone used the function to create a large table of feature pair correlations, and then they were interested in comparing the pairs with the highest correlations. They would have trouble comparing highly correlated positive pairs with highly correlated negative pairs since they would be at opposite ends of the table.

SpellCheck flags from spelling::spell_check_package():

avaliable
- README.md:59
- README.Rmd:50
- Rcat-vignette.Rmd:110
Defaultly
- Rcat-vignette.Rmd:56
demostration
- README.md:220
- README.Rmd:191
intger
- README.md:204
- README.Rmd:178
regluar
- Rcat-vignette.Rmd:68
Vigenette
- README.md:21
- README.Rmd:30
visualisation
- README.md:68
- README.Rmd:56
- Rcat-vignette.Rmd:116
visualisations
- README.md:59
- README.Rmd:50
- Rcat-vignette.Rmd:110

Ultimately, I couldn't find any major issues despite trying my best to break your functions. In fact, it is my opinion that the package is relatively well-polished and ready for deployment. Great work!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

First of all, I want to preface my comments by saying that for the most part, your package functioned as documented, and it was simple to understand and easy to use all of its functions. As such, most of my feedback is based on smaller technical details that I think would be reasonable to implement, and would improve the overall quality of the package.

Functions

misscat

The documentation states that the function "Drops rows or columns containing missing values if the number of the missing values exceeds a threshold". However, it seems that it is unable to drop columns whose proportion of missing values exceeds the threshold. For instance, let's take the code below:

df = data.frame("x1" = c(30,NA,NA,NA,NA), "x2" = c(1,2,3,5,6), "x3" = c(1,4,3,5,6))
misscat(df, 0.4)

Based on the documentation, I'd except column "x1" to be dropped, since 80% of its values are missing, which exceeds the 0.4 threshold. However, misscat returns a data frame identical to df, ie., without dropping column "x1". If this is not a straightforward fix, I would suggest changing the documentation to indicate that the function can only be used to drop rows based on missing values.

suscat

After experimenting with different inputs and looking at the source code, it appears that the function identifies outliers as being values that fall outside a specified interval based on ordinality only. This can lead to potentially undesirable behaviour, as shown below

df = data.frame("x1" = c(1, 200, 200, 200, 300, 200, 300, 200, 10000), 
                           "x2" = c(1, 2, 3, 5, 6, 7, 8, 9, 10000)))
suscat(df, c("x1", "x2"), n = 1, num = "number")

In this case, values 1 and 10000 are identified as outliers in both columns, whereas in reality, 1 is probably not an outlier in the "x2" column. Instead of just looking at ordinality to drop values, it would be better to use a slightly more advanced technique of identifying outliers, even something as simple as considering the number of standard deviations from the mean of each column.

repwithna

My only comment on this function is that it appears that with the rmvpunc argument set to TRUE, it is not exactly clear to me what symbols are considered to be punctuation. For instance, the string "@@@***" is replaced with NA, whereas "@@@***^^^" is not, which suggests that "@" and "*" are considered punctuation while "^" is not. In reality, I wouldn't consider any of those symbols to be punctuation. It would be useful to use the word "symbol" instead of "punctuation" in your documentation, and also list the characters which are considered to be "symbols" by the rmvpunc argument.

topcorr

My only comment on this function is that it appears it calculates Pearson correlation coefficients, and as such, ignores categorical features. It would be useful to explicitly state this in the documentation, since there are other correlation metrics for categorical features.

Other miscellaneous comments

In your README and other documentation, it would be better to use more impersonal language and avoid terms like "our" and "we", given that this is a public package. This is a comment that was given to my group by a reviewer at an earlier stage of our project, and that I tend to agree with.
All unit tests passed locally with a code coverage of 95%, but to reiterate what the previous reviewer mentioned, I also get a warning about the soft-deprecated dplyer::funs. Thus, I would also suggest looking into replacing this function to ensure compatibility with future versions.

Overall this is a nice package with straightforward, yet useful functions. Apart from a few edge cases, the functions work as documented. Good job!

@RobBlumberg

Thank you for taking the time to review our packages.

All your suggestions are valuable to our team! Due to limited time, we have addressed and implemented the following changes based on your review:

misscat: Fixed and implemented the documentation for the threshold parameter in both function and readme.
suscat: This was indeed one of the ideas for ways to implement the function, however, it has its own issues, for example heavily skewed distributions would show outliers only from one side. Ideally, this function would have had more than one method available for use (such as confidence intervals, SD distance, others). However due to time constraints, the choice was made to proceed only with the CI method, hopefully, the others will be added in the future.
repwithna: used the word "symbol" instead of "punctuation" and renamed the argument as "rmvsym".
topcorr: modified the documentation to state "Pearson correlation"

There are some other suggestions we did not implement in this milestone. We will definitely take into account your constructive feedback in the future releases.

You can find our new release here

@tejasph

Hi Tejas! Thank you for taking the time and effort in reviewing our package. And thanks for the constructive feedback. We are happy to address feedback and implement the following changes in each function:

repwithna: replace dplyr::funs() function with list function as suggested. Regarding using vapply instead of sapply, it seems unnecessary as the input type can only be a data frame due to the input test provided.
topcorr: order the feature pairs in terms of the absolute correlation value.
suscat: move the input tests to a helper function as suggested, good idea, looks cleaner.
misscat: reword the threshold parameter in the documentation is implemented; regarding adding the option of dropping the column, we see that dropping a variable should be in the user's hand and decided to continue without making it an option in the function.

The new release can be found here.

UBC-MDS / software-review

Submission: Rcat (R) #20

Scope

Technical checks

Publication options

Code of conduct

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

General

Functionality/Documentation

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Functions

Other miscellaneous comments