UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: encodeR (R) #13

Open braydentang1 opened 4 years ago

braydentang1 commented 4 years ago

name: encodeR about: This package seeks to provide a convenient set of functions that allow for the encoding of categorical features in potentially more informative ways when compared to other, more standard methods. The user feeds as input a training and testing dataset with categorical features, and the resulting data frames returned are preprocessed with a specific encoding of the categorical features. At a high level, this package automates the preprocessing of categorical features in ways that exploit particular correlations between the different categories and the data without increasing the dimension of the dataset, like in one hot encoding. Thus, through the more deliberate handling of these categorical features, higher model performance can possibly be achieved.


Submitting Author: Team Maryam Mirzakhani (L02 Group 17: @bbaillie @braydentang1 @robilizando @fsywang )
Repository: https://github.com/UBC-MDS/encodeR Version submitted: 1.2.0 Editor: @kvarada
Reviewer 1: @camadi
Reviewer 2: @singh-karanpal
Archive: TBD
Version accepted: TBD


Package: encodeR
Title: A collection of categorical encoders in R
Version: 0.0.0.9000
Authors@R: 
    c(person(given = "Brayden",
           family = "Tang",
           role = c("aut"),
           email = "brayden.tang1@gmail.com"),
    person(given = "Shiying",
           family = "Wang",
           role = c("aut"),
           email = "fsywang@ucdavis.edu"), 
    person(given = "Bronwyn",
           family = "Baillie",
           role = c("aut", "cre"),
           email = "Baillie.bronwyn@gmail.com"),            
    person(given = "Robert",
           family = "Pimentel",
           role = c("aut"),
           email = "robilizando@yahoo.com")) 
Description: Employs many modern ways to encode categorical features to hopefully produce more informative encodings
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.0.2
Imports: 
    dplyr,
    readr,
    rlang,
    tidyr,
    purrr,
    magrittr,
    tidyselect,
    fastDummies
Suggests: 
    testthat (>= 2.1.0),
    knitr,
    rmarkdown,
    pkgdown,
    covr
VignetteBuilder: knitr
URL: https://github.com/UBC-MDS/encodeR
BugReports: https://github.com/UBC-MDS/encodeR/issues

Scope

This package preprocesses categorical features by encoding them in novel ways for better modeling performance.

The target audience of this package are people who practice predictive modeling. Any data scientist or researcher who is focused on supervised or unsupervised learning tasks will find this package useful, especially if their dataset contains a large amount of categorical features.

There are some packages in R that include different, more sophisticated kinds of encoding methods. The well known framework H20 has a function for target encoding, and the recipes package has the ability to one hot encode. The package cattonum also contains many kinds of encoding schemes such as frequency encoding, target encoding, and one hot encoding.

However, our package implements conjugate encoding which is a very new method of encoding categorical features, published in 2019. This does not have any R implementation. Furthermore, all of these functions in this package are all in one place, rather than scattered amongst many different packages with different API's.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

JOSS Options - [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)
MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

camadi commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

This is a cool package capable of solving problems for the target audience.

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing:

2 hours


Review Comments

A statement of need/Usefulness There is clearly-documented statement of need which demonstrates the usefulness of the package. This is a very useful package since it integrates different encoding types into one package, (including quite popular encoders) without having to import different packages. It also incorporates new encoding method - based off of research paper.

Installations: The installation via devtools::install_github("UBC-MDS/encodeR"), was tested as per instructions. I ran the installation on macOS Catalina 10.15.3, and can confirm that the instruction was successful without hitches. All guidelines were straight-forward and easy to follow.

Local installation Local installation was also successful.

Functionality All the individual functions worked well. However, I was expecting to see an error or exception when I pass a “non categorical variable” like hp of mtcars in the cat_columns argument of onehot_encoder function, for example. But the process still completed successfully. Can you raise exception and add a test to check the exception?

It is worthy of mentioning that the following very informative exception was raised for Conjugate encoding function: NA's fitted for expected variance. The variance of a single data point does not exist. Make sure columns specified are truly categorical.Joining, by = "hp".

Performance There was no performance issue at all.

Tests

Upon installing ‘pkgdown’ locally, the rcmdcheck test passed without any error or warning. However, this pkgdown library was NOT listed as a dependency. You may want to investigate further.

covr::package_coverage() call returned:

encodeR Coverage: 96.15% R/target_encoder.R: 94.83% R/frequency_encoder.R: 96.15% R/onehot_encoder.R: 96.55% R/conjugate_encoder.R: 96.84%

General Comments I am wondering if there is any reason why you included only the conjugate_encoder and target-encoder functions in your readme unlike all functions in vignettes.

Overall, there are several evidences of hardwork and well-thought-of package. The package is functional, and has the capacity to be incredibly useful.

singh-karanpal commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in the comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

For packages co-submitting to JOSS

The package contains a paper.md matching JOSS's requirements with:

  • [ ] A short summary describing the high-level functionality of the software
  • [ ] Authors: A list of authors with their affiliations
  • [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
  • [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hours


Review Comments

braydentang1 commented 4 years ago

Responses:

@camadi

Thanks for the review.

Addressed for this release:

Did not address for this release:

@singh-karanpal

Thanks for the feedback! We are happy that the installation and package usage went smoothly for you.

Release Link:

All of the addressed feedback, as discussed above, can be viewed in our latest release, v1.2.1.