Add weight of evidence encoder

tlienart commented 4 years ago

adds WOEEncoder to the encoders described here https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
WOE is a supervised encoder in the case of binary output which can encode
- categorical data
- numerical data (after binning)
The current implementation assumes that the data has previously gone through an imputer and so assumes that there is no missing values (either in X or y).

Notes:

adds a titanic.csv dataset for testing, the testing is done against sklearn-contrib (https://contrib.scikit-learn.org/category_encoders/woe.html)
it's a supervised encoder expecting binary target; this doesn't work well with ScikitLearn's automatic check_estimator which tries to fit the estimator with Boston (3 classes). Please advise.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cc @jeanfad

Edits

f30f650 fixes
- a line too long issue (reference to sklearn)
- an except with an error variable that was unused
- a naming issue (variable named iter, pylint doesn't like that.

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 69e6b2f9008aaea04bfe7533787004aba4814e90
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 165902d1497e5df89d739c4518852d7794e71ca8
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: f30f6500fe7910b48fd67bb976481988e6680788
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

tlienart commented 4 years ago

Thanks a lot for the review, I'll fix everything I can asap

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: f30f6500fe7910b48fd67bb976481988e6680788
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 19c6c4e4d7920748b647eda42274cf8562ff96e5
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 6f8ac9739768701a6576c344ddd48c4082c241c1
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

tlienart commented 4 years ago

Right I think I addressed all your comments & extended the spirit of some (e.g. removed all attributes unused in transform). Pandas is removed.

The only remaining point as mentioned in the comments is that you seem to want the check_estimator to be applied to WoE, I may be missing something but the check_estimator uses boston which has three classes and therefore will always fail with WoE. My suggestion is to ignore it and, if there is concern that we may be missing something, suggest the explicit addition of extra tests for WoE that are related to dimensions etc.

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 6c11ab6f98d134714612c87287e990ea82c26e34
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 3c951a5d5fda81b154ca5aabbcd702258483c75e
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 269b4b6cdb6a057ad6c3cd9d8de3a95e088dddaa
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

tlienart commented 4 years ago

gentle bump, thanks!

wiltonwu commented 4 years ago

Right I think I addressed all your comments & extended the spirit of some (e.g. removed all attributes unused in transform). Pandas is removed.

The only remaining point as mentioned in the comments is that you seem to want the check_estimator to be applied to WoE, I may be missing something but the check_estimator uses boston which has three classes and therefore will always fail with WoE. My suggestion is to ignore it and, if there is concern that we may be missing something, suggest the explicit addition of extra tests for WoE that are related to dimensions etc.

I think the build logs still show some flake8 and black linting tests failing. Those should be straightforward to fix, please let me know if you need help.

For the check_estimator test, I think you can add a tag for your new estimator to mark it as only compatible with binary classification datasets. It involves overriding the _get_tags() or _more_tags() method. We do something similar for RobustImputer. See https://scikit-learn.org/stable/developers/develop.html#estimator-tags. check_estimator should know which specific tests to run based on those tags (it might even skip the entire test). Also if you haven't seen that page, I'd recommend reading through it--it's a good guideline for writing scikit-learn compatible estimators.

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 4034f8373fa7c4f2cc07fe442ea74bda15081acf
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

tlienart commented 4 years ago

fixed what was requested by the flake8 ; it's unclear to me what the black linting wants (concrete help on specific lines to adjust appreciated)
I ended up implicitly turning off the check_estimator by setting 'X_types': ['categorical'] because it tests by default with continuous values and this causes issues in the later checks at the transform stage (in testing whether what gets encoded has new categories); I did fix a few minor things based on the earlier tests though.

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: aad5e404980b0e3ef9c832d38b505203cec6570b
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: 4e04fada960188315973e2c4d54c17cb7ffc2ec0
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

wiltonwu commented 4 years ago

AWS CodeBuild CI Report

CodeBuild project: sagemaker-sklearn-extension-pr
Commit ID: c3988864e44ab9d38173565b735e7d5419da9922
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

aws / sagemaker-scikit-learn-extension