UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: encoderPy (Python) #18

Open bbaillie opened 4 years ago

bbaillie commented 4 years ago

Submitting Author: Team Maryam Mirzakhani (L02 Group 17: @bbaillie @braydentang1 @robilizando @fsywang )
Package Name: encoderPy One-Line Description of Package: Creates numeric encodings for categorical variables Repository Link: https://github.com/UBC-MDS/encoderPy Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Alistair Clark (@alistair-clark ) Reviewer 2: Brendon Campbell (@brendoncampbell )
Archive: TBD
Version accepted: TBD


Description

This package seeks to provide a convenient set of functions that allow for the encoding of categorical features in potentially more informative ways when compared to other, more standard methods. The user will feed as input a training and testing dataset with categorical features, and the resulting data frames returned will be preprocessed with a specific encoding of the categorical features. At a high level, this package automates the preprocessing of categorical features in ways that exploit particular correlations between the different categories and the data without increasing the dimension of the dataset, like in one hot encoding. Thus, through the more deliberate handling of these categorical features, higher model performance can possibly be achieved.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

This package preprocesses categorical features by encoding them in novel ways for better modelling performance.

The target audience of this package are people who practice predictive modeling. Any data scientist or researcher who is focused on supervised or unsupervised learning tasks will find this package useful, especially if their dataset contains a large amount of categorical features.

There is one notable package in Python that has a variety of different methods for more informative encodings of categorical features, aptly named Category Encoders. However, Category Encoders does not include a frequency encoder or a conjugate-prior encoder. These two encoders are inherently useful since frequency encoding has become relatively popular in the past couple of years, especially in Kaggle competitions and conjugate encoding is a new, state of the art methodology that has been shown to work well on many datasets. Furthermore, this package fully supports Pandas dataframes and will not drop column names, which eliminates any ambiguity in what each column represents with respect to the original columns/features.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

No.

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

Editor and Review Templates

Editor and review templates can be found here

brendoncampbell commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 1.5


Review Comments

alistair-clark commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hours


Review Comments

Let me know if you have thoughts on how to fix the installation, and then I can check the rest of the boxes and approve the package!

braydentang1 commented 4 years ago

@alistair-clark

Hey, thanks for the review.

I am not too sure on these issues, particularly with colorama (I don't recall ever using this package directly too be honest).

We forgot to mention in our README, as @brendoncampbell pointed out, that one needs to have semantic-release installed. Have you run this?

pip install python-semantic-release

Otherwise, I am unsure what is causing these issues. I am able to install the package on my end after installing semantic-release.

alistair-clark commented 4 years ago

@alistair-clark

Hey, thanks for the review.

I am not too sure on these issues, particularly with colorama (I don't recall ever using this package directly too be honest).

We forgot to mention in our README, as @brendoncampbell pointed out, that one needs to have semantic-release installed. Have you run this?

pip install python-semantic-release

Otherwise, I am unsure what is causing these issues. I am able to install the package on my end after installing semantic-release.

That fixed it! I was able to install the package after running the above. It's strange because I already had python-semantic-release installed, so it must have been something to do with dependencies. I noticed that running the above uninstalled colorama 0.4.3 and then reinstalled colorama 0.4.1

I'll update the checkboxes above now that it works.

braydentang1 commented 4 years ago

Responses:

@brendoncampbell

Thanks for the feedback.

Addressed for the latest release:

Did not address in this release:

@alistair-clark

As previously noted, thanks for the feedback as well.

Addressed for the latest release:

Did not address in this release:

Thanks to both of you for the great feedback.

Release Link:

All of the addressed feedback, as discussed above, can be viewed in our latest release, v1.2.1.

alistair-clark commented 4 years ago

Thanks for following up. Great job with the package. I hope people find it and use it!