UBC-MDS / software-review-2022

0 stars 0 forks source link

Submission Group 24: nedahelpeR (R) #33

Open hjw0703 opened 2 years ago

hjw0703 commented 2 years ago

Submitting Author Name: Michelle Wang(@michelle-wms), Jiwei Hu (@hjw0703), Anupriya Srivastava (@Anupriya-Sri), Sam Quist(@squisty)

Repository: (https://github.com/UBC-MDS/nedahelpeR) Version submitted:1.0.0 Submission type: Standard Editor: Michelle Wang(@michelle-wms), Jiwei Hu (@hjw0703), Anupriya Srivastava (@Anupriya-Sri), Sam Quist(@squisty)

Reviewer 1: Ruben De la Garza Macias Reviewer 2: Jennifer Hoang Reviewer 3: Harry Chan Reviewer 4: Junrong Zhu


Package: nedahelpeR
Title: Streamline Exploratory Data Analysis
Version: 0.0.0.9000
Authors@R: 
    c(
    person(given = "Jiwei", 
    family = "Hu",
    email = "hjw0703@student.ubc.ca", 
    role = c("aut", "cre")),
    person(given = "Anupriya",
    family = "Srivastava", 
    role = c("aut")),
    person(given = "Michelle",
    family = "Wang", 
    role = c("aut")),
    person(given = "Samuel",
    family = "Quist", 
    role = c("aut"))
    )
Description: Data scientists often spend a lot of time preprocessing data to extract useful parts for their analysis.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Suggests: 
    knitr,
    rmarkdown,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Imports: 
    purrr,
    dplyr,
    tibble,
    data.table,
    tidyr,
    stats,
    utils
VignetteBuilder: knitr

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see [notes on categories][NotesOnCategories] of our guidebook.

The package includes functions which can complete the following tasks:

Display some useful statistics
Detect outliers
Deal with missing values
Check for correlation between features

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

harryyikhchan commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.: We are in the same cohort of the UBC MDS Program. I'm the reviewer of this package under the course DSCI524

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1 hour


Review Comments

Keep up the great work! You made a great improvement to build up the R package for everyone. Here are my suggestions.

  1. For your function, it is recommended to use return() only for early returns and rely on the final result of the last evaluated expression. Here are the details of when to use the return() here
  2. In function get_correlated_features(), I think the input argument threshold can be assigned to a default value. It makes the function more intuitive and user-friendly
  3. In function flag_outliers, I tried a dataframe with categorical columns. It didn't return a proper error message. It would be great to return error messages such as "The input data must be numeric".
  4. in function flag_outliers, it would be better if you can include a brief description of how the threshold value works under the hood. Some readers may want to get more details.
  5. I understand that the input dataframe should be numeric. However, the real data usually consists of different data types such as categorical, binary, ordinal. It would be great if you can include other data types as well in your functions for future improvement.

Overall, I was just being picky and I really find this package very useful!

jennifer-hoang commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1 hr


Review Comments

Great work again everyone! Please find my suggestions below:

  1. In the function documentation of missing_imputer, it would be nice to list the possible methods of imputation in the roxygen2 comments, similar to how it was shown in the python version of the package.
  2. In the function documentation of flag_outliers, it would be useful to clarify how NA values are handled by the function in the percentage calculations (i.e. are they included or excluded?).
  3. In the function correlated-features, exception handling could be added for the consider_sign argument.
  4. In the function correlated-features, an error could be raised when a data frame contains a non-numeric column to improve usability.
  5. There are a few minor tidyverse style guide inconsistencies in some of the scripts and tests, such as spacing around operators. Using the styler feature in RStudio could help address these points and improve code readability for future users and contributors.

Best, Jennifer

zjr-mds commented 2 years ago

title: "review" output: rmarkdown::md_document: pandoc_args: [ "--wrap=none" ]

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1hr

Great package and everything worked as expected! Great job again everyone! (I had to be super picky to find any improvements you guys may need XD) please find my comments below:

  1. It’d be better to specify the data type of the input data parameter in the Roxygen2 comments of the missing_imputer function
  2. Additional tests can be added to check the output result of the overview function. So far you’ve checked the sum of the means and sum of variance, maybe you can include an additional one for the medians and stds, or just one for the entire df.
  3. It’d be nice to add some code comments for the Overview function like how you guys did in the other three functions.
  4. It’d be better to specify the data type of ‘threshold’ in the flag_outliers function.
  5. Also, like the reviewer mentioned above, there are minor tidyverse style inconsistencies that you could improve if you got the time.

Review Comments

ruben1dlg commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing:


Review Comments

Your package is impressive guys! I just have a few minor comments:

  1. In the missing_imputer function argument description, I think it would be a good idea to specify what are the possible values that 'method' can get.
  2. For the flag_outliers function (Roxygen), I would add a more detailed explanation about what the function is doing, it is not very clear to me how the function is flagging the outliers and how is it defining by the threshold which data point are actually outliers. In the vignette it is very clear how the function is actually working!
  3. As an enhancement to the overview function, I would probably think adding some quantile information could be very useful, just to have a better idea of the distribution without having to see it.
  4. While going through your functions it was kind of hard for me to read the code sometimes because of the way brackets are opened, closed and the indentation in some cases. It would be good to correct some of it, but this is something minor since the functions work as expected!
  5. I did not see a lot of code inside your functions for handling edge cases, for example what happens when the user enters a df with an ID column and a numeric column, or cases when you only have one data point or an empty df. I do not think this will be a problem, but it is cool to have those cases covered when you never know what the user can input into the function.