Open braydentang1 opened 4 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
This is a cool package capable of solving problems for the target audience.
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing:
2 hours
A statement of need/Usefulness There is clearly-documented statement of need which demonstrates the usefulness of the package. This is a very useful package since it integrates different encoding types into one package, (including quite popular encoders) without having to import different packages. It also incorporates new encoding method - based off of research paper.
Installations: The installation via devtools::install_github("UBC-MDS/encodeR"), was tested as per instructions. I ran the installation on macOS Catalina 10.15.3, and can confirm that the instruction was successful without hitches. All guidelines were straight-forward and easy to follow.
Local installation Local installation was also successful.
Functionality
All the individual functions worked well. However, I was expecting to see an error or exception when I pass a “non categorical variable” like hp
of mtcars
in the cat_columns
argument of onehot_encoder function, for example. But the process still completed successfully. Can you raise exception and add a test to check the exception?
It is worthy of mentioning that the following very informative exception was raised for Conjugate encoding function: NA's fitted for expected variance. The variance of a single data point does not exist. Make sure columns specified are truly categorical.Joining, by = "hp".
Performance There was no performance issue at all.
Tests
Running test()
carried out 48 tests and all passed
check()
gave no error or warning but 3 notes which is fine
rcmdcheck::rcmdcheck(args = "--no-manual", error_on = "warning", check_dir = "check")
resulted in an error message thus:
checking package dependencies ... ERROR
Package suggested but not available: ‘pkgdown’,
1 error x | 0 warnings ✓ | 0 notes ✓
Upon installing ‘pkgdown’ locally, the rcmdcheck
test passed without any error or warning. However, this pkgdown
library was NOT listed as a dependency. You may want to investigate further.
covr::package_coverage()
call returned:
encodeR Coverage: 96.15% R/target_encoder.R: 94.83% R/frequency_encoder.R: 96.15% R/onehot_encoder.R: 96.55% R/conjugate_encoder.R: 96.84%
General Comments I am wondering if there is any reason why you included only the conjugate_encoder and target-encoder functions in your readme unlike all functions in vignettes.
Overall, there are several evidences of hardwork and well-thought-of package. The package is functional, and has the capacity to be incredibly useful.
Please check off boxes as applicable, and elaborate in the comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, BugReports
and Maintainer
(which may be autogenerated via Authors@R
).For packages co-submitting to JOSS
- [ ] The package has an obvious research application according to JOSS's definition
The package contains a
paper.md
matching JOSS's requirements with:
- [ ] A short summary describing the high-level functionality of the software
- [ ] Authors: A list of authors with their affiliations
- [ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
- [ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).
Estimated hours spent reviewing: 1.5 hours
@camadi
Thanks for the review.
Addressed for this release:
Did not address for this release:
Regarding your suggestion on non-categorical columns being specified to cat_columns, we have decided to not implement this change since there are logistical problems associated with implementing this. Namely, a categorical column is not necessarily a factor or a character vector; it can be numeric (such as category 1, 2, 3, ... encoded as 1, 2, 3, ...). In the mtcars dataset, for example, the number of cylinders is a categorical variable but it is treated as a numeric variable. This is why we included cat_columns as an argument in the first place - to allow the user to decide for themselves what is and what is not a categorical variable. While indeed, we could perhaps specify some arbitrary threshold of unique values (like 5) and perhaps raise a warning, categorical features with high cardinality (like text data, for example) will naturally raise these warnings repeatedly which would make the package less user friendly and intuitive. This is especially a problem for our package because a big advantage for using many of the encoders (frequency encoding, conjugate encoding, and target encoding) is that they do not increase the dimension of the dataset by a large amount that is customary with categorical features of high cardinality. Therefore, there is greater reason to use these encoders for datasets with categorical features with a large number of categories, which would contradict the warnings that would be raised in these situations.
Regarding the usage examples in the README, we decided against including more examples of the other functions because we think this is already sufficiently covered in detail in the vignette. We feel that the usage section will get too crowded if we show all four functions, and is only meant to showcase basic functionality of the package.
@singh-karanpal
Thanks for the feedback! We are happy that the installation and package usage went smoothly for you.
All of the addressed feedback, as discussed above, can be viewed in our latest release, v1.2.1.
name: encodeR about: This package seeks to provide a convenient set of functions that allow for the encoding of categorical features in potentially more informative ways when compared to other, more standard methods. The user feeds as input a training and testing dataset with categorical features, and the resulting data frames returned are preprocessed with a specific encoding of the categorical features. At a high level, this package automates the preprocessing of categorical features in ways that exploit particular correlations between the different categories and the data without increasing the dimension of the dataset, like in one hot encoding. Thus, through the more deliberate handling of these categorical features, higher model performance can possibly be achieved.
Submitting Author: Team Maryam Mirzakhani (L02 Group 17: @bbaillie @braydentang1 @robilizando @fsywang )
Repository: https://github.com/UBC-MDS/encodeR Version submitted: 1.2.0 Editor: @kvarada
Reviewer 1: @camadi
Reviewer 2: @singh-karanpal
Archive: TBD
Version accepted: TBD
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
This package preprocesses categorical features by encoding them in novel ways for better modeling performance.
The target audience of this package are people who practice predictive modeling. Any data scientist or researcher who is focused on supervised or unsupervised learning tasks will find this package useful, especially if their dataset contains a large amount of categorical features.
There are some packages in R that include different, more sophisticated kinds of encoding methods. The well known framework
H20
has a function for target encoding, and the recipes package has the ability to one hot encode. The packagecattonum
also contains many kinds of encoding schemes such as frequency encoding, target encoding, and one hot encoding.However, our package implements conjugate encoding which is a very new method of encoding categorical features, published in 2019. This does not have any R implementation. Furthermore, all of these functions in this package are all in one place, rather than scattered amongst many different packages with different API's.
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
JOSS Options
- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)MEE Options
- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)Code of conduct