Submission: Rcat (R) #20

Open HanyingZhang opened 4 years ago

HanyingZhang commented 4 years ago

Submitting Author: Eithar Elbasheer (@EitharAlfatih), Hanying Zhang (@HanyingZhang), Netanel Barasch (@TBarasch), Yingping (Dora) Qian (@doraqmon) Repository: https://github.com/UBC-MDS/Rcat Version submitted: 1.1.0 Editor: @kvarada Reviewer 1: Tejas Phaterpekar (@tejasph) Reviewer 2: Robert Blumberg (@RobBlumberg)
Version accepted: TBD

Package: Rcat
Title: a collection of EDA related functions
Description: Helps users to deal with missing and suspicious values, find the top correlated features during exploratory data analysis stage.
tejasph commented 4 years ago

Package Review

The package includes all the following forms of documentation:

Review Comments

Overall, I was quite impressed by the package's ability to tackle common inconveniences that coders face everyday in RStudio. Every function had a narrow, focused goal and was planned out well. Particularly, I found your README.md file to be organized, succinct, and easy to navigate. Below, are some observations I made as well as some small suggestions:



Ultimately, I couldn't find any major issues despite trying my best to break your functions. In fact, it is my opinion that the package is relatively well-polished and ready for deployment. Great work!

RobBlumberg commented 4 years ago

Package Review

The package includes all the following forms of documentation:

Review Comments

First of all, I want to preface my comments by saying that for the most part, your package functioned as documented, and it was simple to understand and easy to use all of its functions. As such, most of my feedback is based on smaller technical details that I think would be reasonable to implement, and would improve the overall quality of the package.


The documentation states that the function "Drops rows or columns containing missing values if the number of the missing values exceeds a threshold". However, it seems that it is unable to drop columns whose proportion of missing values exceeds the threshold. For instance, let's take the code below:

df = data.frame("x1" = c(30,NA,NA,NA,NA), "x2" = c(1,2,3,5,6), "x3" = c(1,4,3,5,6))
misscat(df, 0.4)

Based on the documentation, I'd except column "x1" to be dropped, since 80% of its values are missing, which exceeds the 0.4 threshold. However, misscat returns a data frame identical to df, ie., without dropping column "x1". If this is not a straightforward fix, I would suggest changing the documentation to indicate that the function can only be used to drop rows based on missing values.

After experimenting with different inputs and looking at the source code, it appears that the function identifies outliers as being values that fall outside a specified interval based on ordinality only. This can lead to potentially undesirable behaviour, as shown below

df = data.frame("x1" = c(1, 200, 200, 200, 300, 200, 300, 200, 10000), 
                           "x2" = c(1, 2, 3, 5, 6, 7, 8, 9, 10000)))
suscat(df, c("x1", "x2"), n = 1, num = "number")

In this case, values 1 and 10000 are identified as outliers in both columns, whereas in reality, 1 is probably not an outlier in the "x2" column. Instead of just looking at ordinality to drop values, it would be better to use a slightly more advanced technique of identifying outliers, even something as simple as considering the number of standard deviations from the mean of each column.

My only comment on this function is that it appears that with the rmvpunc argument set to TRUE, it is not exactly clear to me what symbols are considered to be punctuation. For instance, the string "@@@***" is replaced with NA, whereas "@@@***^^^" is not, which suggests that "@" and "*" are considered punctuation while "^" is not. In reality, I wouldn't consider any of those symbols to be punctuation. It would be useful to use the word "symbol" instead of "punctuation" in your documentation, and also list the characters which are considered to be "symbols" by the rmvpunc argument.

My only comment on this function is that it appears it calculates Pearson correlation coefficients, and as such, ignores categorical features. It would be useful to explicitly state this in the documentation, since there are other correlation metrics for categorical features.

Other miscellaneous comments

Overall this is a nice package with straightforward, yet useful functions. Apart from a few edge cases, the functions work as documented. Good job!

doraqmon commented 4 years ago


Thank you for taking the time to review our packages.

All your suggestions are valuable to our team! Due to limited time, we have addressed and implemented the following changes based on your review:

  1. misscat: Fixed and implemented the documentation for the threshold parameter in both function and readme.
  2. suscat: This was indeed one of the ideas for ways to implement the function, however, it has its own issues, for example heavily skewed distributions would show outliers only from one side. Ideally, this function would have had more than one method available for use (such as confidence intervals, SD distance, others). However due to time constraints, the choice was made to proceed only with the CI method, hopefully, the others will be added in the future.
  3. repwithna: used the word "symbol" instead of "punctuation" and renamed the argument as "rmvsym".
  4. topcorr: modified the documentation to state "Pearson correlation"

There are some other suggestions we did not implement in this milestone. We will definitely take into account your constructive feedback in the future releases.

You can find our new release here

HanyingZhang commented 4 years ago


Hi Tejas! Thank you for taking the time and effort in reviewing our package. And thanks for the constructive feedback. We are happy to address feedback and implement the following changes in each function:

The new release can be found here.