Submission: encoderPy (Python)

Submitting Author: Team Maryam Mirzakhani (L02 Group 17: @bbaillie @braydentang1 @robilizando @fsywang )
Package Name: encoderPy One-Line Description of Package: Creates numeric encodings for categorical variables Repository Link: https://github.com/UBC-MDS/encoderPy Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Alistair Clark (@alistair-clark ) Reviewer 2: Brendon Campbell (@brendoncampbell )
Archive: TBD
Version accepted: TBD

Description

This package seeks to provide a convenient set of functions that allow for the encoding of categorical features in potentially more informative ways when compared to other, more standard methods. The user will feed as input a training and testing dataset with categorical features, and the resulting data frames returned will be preprocessed with a specific encoding of the categorical features. At a high level, this package automates the preprocessing of categorical features in ways that exploit particular correlations between the different categories and the data without increasing the dimension of the dataset, like in one hot encoding. Thus, through the more deliberate handling of these categorical features, higher model performance can possibly be achieved.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [X] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

This package preprocesses categorical features by encoding them in novel ways for better modelling performance.

Who is the target audience and what are scientific applications of this package?

The target audience of this package are people who practice predictive modeling. Any data scientist or researcher who is focused on supervised or unsupervised learning tasks will find this package useful, especially if their dataset contains a large amount of categorical features.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There is one notable package in Python that has a variety of different methods for more informative encodings of categorical features, aptly named Category Encoders. However, Category Encoders does not include a frequency encoder or a conjugate-prior encoder. These two encoders are inherently useful since frequency encoding has become relatively popular in the past couple of years, especially in Kaggle competitions and conjugate encoding is a new, state of the art methodology that has been shown to work well on many datasets. Furthermore, this package fully supports Pandas dataframes and will not drop column names, which eliminates any ambiguity in what each column represents with respect to the original columns/features.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has an OSI approved license
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

No.

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[X] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5

Review Comments

Awesome implementation of state-of-the-art techniques in package form (and attribution to the relevant research on the README)
- I would consider also adding reference to the conjugate encoding paper your implementation is based on elsewhere in your documentation as well
python-semantic-release should be added to the dependencies listed on the README
- I had to manually install python-semantic-release after an initial error, but after doing so installation completed with no issues
Vignette related observations on the README: It was not super clear whether the subsections below the 'Vignette' header (ex. 'Target Encoding') were meant to be function descriptions or vignette examples or both. Assuming those subsections are meant to be vignette content:
- The structure/flow of these subsections was a little confusing to follow on a first pass. You may want to consider adjusting the heading levels within the README so that the 'vignette' header appears to be equal to (or the parent of) the example sections that follow.
- Excellent job with the comprehensive examples. Given the length and potential for additional elaboration, you may want to consider keeping a streamlined usage example in the README and moving the full vignette content to its own file/page/URL (to avoid crowding out the other README content)
Function code was well formatted and clear, but could use some inline comments to enhance readability

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all user-facing functions
[X] Examples for all user-facing functions
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[X] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[X] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[X] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[X] Installation instructions
[X] Any additional setup required (authentication tokens, etc)
[X] Brief demonstration usage
[X] Direction to more detailed documentation (e.g. your documentation files or website).
[X] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[X] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hours

Review Comments

The README is extremely well documented. It lays out strong rationale for why I might want to use this package, and includes detailed examples and descriptions for each function. This is great.
I was not able to install the package. Here are the issues I ran into:
- The first error I received was related to flake8 (maybe add to your dependencies list in the README?)
- After installing/upgrading flake8, I received this error: ERROR: No matching distribution found for colorama==0.4.1 (from ndebug~=0.1->python-semantic-release<5.0.0,>=4.10.0->encoderpy). I have colorama==0.4.1 installed, so I'm not sure where this issue is coming from. I noticed that colorama is not in your pyproject.toml, so it's possible this is causing the issue. Please let me know if you've run into this error and have recommendations for how to address it.
The documentation on the README is excellent. One suggestion to make it better would be to show the output dataframe of each function. I believe this was your intention because you say, "Observe that both the cyl and the vs columns have been replaced with fully numeric columns.", but it's not appearing anywhere. I'd want to see the output examples before I'd consider downloading and using the package.
This is minor, but I noticed that the description of arguments across functions varies, even when describing the same argument. I haven't checked all arguments, but I'd recommend making them consistent if they are not. For example, the X_test argument has a different description in every function.
- conjugate_encoder: A pandas dataframe representing the test set, containing some set of categorical features/columns. This is an optional argument.
- frequency_encoder: An optional pandas dataframe representing the test set, containing some set of categorical features/columns. Default is None.
- onehot_encoder: A pandas dataframe representing the test set, containing some set of categorical features/columns.
- target_encoder: A pandas dataframe representing the test set, containing some set of categorical features/columns. default None.
Related to the above comment, I noticed that onehot_encoder is the only function where X_test is not an optional argument. Unless there's a good reason for this, I'd make this consistent with the other functions.
I noticed that in your test functions you would often call the test after the function is written (e.g. test_output()). I could be wrong, but I don't believe you need to do this as poetry and pytest will run all the functions in the test file when doing it's checks. For example, here are some other test file examples:
Some of your functions contain detailed input checking and exception handling, while others do not.
- Related to this, I'd consider avoiding raising bare "Exceptions" to catch errors as it can have unintended consequences (see this thread for more discussion).

Let me know if you have thoughts on how to fix the installation, and then I can check the rest of the boxes and approve the package!

@alistair-clark

Hey, thanks for the review.

I am not too sure on these issues, particularly with colorama (I don't recall ever using this package directly too be honest).

We forgot to mention in our README, as @brendoncampbell pointed out, that one needs to have semantic-release installed. Have you run this?

pip install python-semantic-release

Otherwise, I am unsure what is causing these issues. I am able to install the package on my end after installing semantic-release.

@alistair-clark

Hey, thanks for the review.

I am not too sure on these issues, particularly with colorama (I don't recall ever using this package directly too be honest).

We forgot to mention in our README, as @brendoncampbell pointed out, that one needs to have semantic-release installed. Have you run this?

pip install python-semantic-release

Otherwise, I am unsure what is causing these issues. I am able to install the package on my end after installing semantic-release.

That fixed it! I was able to install the package after running the above. It's strange because I already had python-semantic-release installed, so it must have been something to do with dependencies. I noticed that running the above uninstalled colorama 0.4.3 and then reinstalled colorama 0.4.1

I'll update the checkboxes above now that it works.

Responses:

@brendoncampbell

Thanks for the feedback.

Addressed for the latest release:

We included a more explicit reference to the source paper for the conjugate encoding function directly in the documentation. Thanks for the suggestion!
We added the Python semantic-release dependency directly in the README so users know to install it before trying to install our package.
We already had added semantic-release as an explicit dependency in the .toml file. It is unknown why it doesn't automatically install on its own when trying to install the package, but regardless, at least it throws an error.
We heavily reorganized our README and vignette to be more readable. First, we separated out the vignette from the README file as you suggested to make the README less cluttered, by hosting the vignette as a hosted HTML file instead. In addition, we made the headers more distinct so that sections and subsections were more clear to the reader.
Finally, we improved all inline comments in our functions and tests as you suggested.

Did not address in this release:

We addressed everything brought up by Brendon.

@alistair-clark

As previously noted, thanks for the feedback as well.

Addressed for the latest release:

As mentioned above, we added the python semantic-version to the README file so that no more users run into this issue! While it would be nice if the pip install command installs all the dependencies, it looks like this is not a trivial task.
The vignette has been redone in Python in RMarkdown to get visible output. Indeed, we intended to have the output there but we were a bit unsure of how to approach this without completely hosting a separate file. Now that we have hosted a separate HTML, this is pretty easy to accomplish. Thanks for the suggestion!
All of our docstrings have been redone to be consistent across all of our functions.
We added X_test=None as default in onehot_encoder.py. The intention was to include this before, but it was not included by accident. Thanks!

Did not address in this release:

We decided against removing the function calls in our tests. When we try to remove them, the code coverage decreases for some odd reason. It turns out poetry treas the test function as another condition to cover, so we are keeping these calls for now.
Regarding the defensive coding improvements, note that both the target_encoder.py and conjugate_encoder.py functions involve more arguments and therefore require much more defensive exception handling. For example, the frequency and one hot encoder functions do not involve any prior specifications, response variables, or response variable types because they are not used in these encoding schemes. Otherwise, much of the other defensive tests (such as input checks) are the same across all of our functions (though in some cases, hidden throughout the function). Therefore, we did not change anything regarding this issue.
The suggestion to change from general exception raising to actual, more specific errors is really good and we agree with it. However, this would require running through all of our code to raise the correct exceptions per defensive test (and then possibly changing our tests in pytest) and we lacked the time to do this. In the future, this is a very important thing to change and we are thankful for you bringing this to our attention!

Thanks to both of you for the great feedback.

Release Link:

All of the addressed feedback, as discussed above, can be viewed in our latest release, v1.2.1.

Thanks for following up. Great job with the package. I hope people find it and use it!

UBC-MDS / software-review