Submission: feature-selection (Python)

Submitting Author: Ryan Homer (@ryanhomer), Jacky Ho (@jackyho112), Derek Kruszewski (@dkruszew), Victor Cuspinera (@vcuspinera) Package Name: feature-selection One-Line Description of Package: Feature Selection for Machine Learning Models Repository Link: https://github.com/UBC-MDS/feature-selection-python Version submitted: 1.1.2 Editor: Varada Kolhatkar (@kvarada) Reviewer 1: Lise Braaten (@lisebraaten) Reviewer 2: Tao Huang (@taohuang-ubc) Archive: TBD Version accepted: TBD

Description

Include a brief paragraph describing what your package does:

Feature selection in machine learning will reduce complexity, reduce the time when training an algorithm, and improve the accuracy of your model (if selected wisely). However, this is not a trivial task. The feature-selection Python package implements feature selection algorithms that can work with existing machine learning model such as those from scikit-learn.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*
- [X] Machine Learning

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

This package implements data feature selections algorithms. It is expected that the user will make use of packages such as [caret][2] in order to do the actual model fitting and scoring. This package then makes use of these results to carry out feature selection.
Who is the target audience and what are scientific applications of this package?

It is expected that this package will be helpful to mainly data scientists, students of data science, and in general, anyone involved in machine learning.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Some of the feature selection algorithms exists on the Python platform. Others either do not have an implementation or their implemetation is fairly complex. This package aims to provide easy-to-use implementations of feature selection algorithms and to provide a more seamless experience between the R and Python platforms by providing a companion R edition that works in a very similar way.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

n/a

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[X] does not violate the Terms of Service of any service it interacts with.
[ ] has an OSI approved license
[ ] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[X] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hours

Review Comments

Hey! Awesome work everyone - I definitely see the usefulness of your package and enjoyed taking a look through it.

Installation was super smooth and the usage scenario using the Friedman dataset was really well done. I liked how the expected output was included for each function giving me the opportunity to ensure everything was functioning as intended.
Function documentation is very thorough and informative. You did a great job of helping me understand all the functionalities of your package, especially with your descriptions of each function's use and relevant examples.
Amazing job on the tests - they cover a wide variety of edge cases and scenarios, are well documented, extremely thorough and reach 100% coverage.
Good idea to link to your R package that has the same functionalities in your README (I may even steal this idea for our group's package)!

A few suggestions on things to update/look into:

I believe scikit-learn v.0.22.2 should be listed as a dependency in the README (I was prompted to update my scikit-learn upon installing your package).
When I ran the usage scenario for the recursive_feature_elimination function and forward_selection function I got the following error in both cases: NameError: name 'y' is not defined. Maybe the input to the function should be a capital Y instead of y? However, when I tried it with this change, I got a different output than the one in the README for both functions. Not entirely sure what is going on here but definitely something to look into.
Not sure if scorer() needs to be defined twice in the README in the exact same way (for forward_selection and simulated_annealing). Might make more sense to name the one used in the recursive_feature_elimination example something different so you do not need to redefine the function again.
For the variance_thresholding usage scenario, the "from" is missing in the line where the function gets imported (should read from feature_selection.variance_thresholding import variance_thresholding).
simulated_annealing function documentation on Read the Docs appears to be missing subsections (ie. parameters, returns, return type, examples). I see these are present in the docstring for the function but do not appear to be rendering.

Thanks for giving me the opportunity to review your package! Please address the above suggestions before final approval. Awesome work :)

Review Template

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1

Review Comments

Hi guys, excellent work overall. feature_selection is definitely a good package to create. The documentation is well written. The following are my suggestions:

recursive_feature_elimination produces output other than the index of selected features.

It would be nice if you can modify the function and let it only produce the indices, no the 5 10.

I found 'variance_thresholding` is a bit hard to comprehend, maybe elaborate on the docstring? Are we calculating the variance of each features?
Also, for 'variance_thresholding`, I do not understand the reason why 0 was set as default threshold, I think, as long as the numbers are different, i.e, [1,0,0], the variance is always bigger than 0?
Also, I guess you were intended to set 0 as default value for threshold, but in docstring, you wrote optional for that argument. Please fix :)
For simulated_annealing, I follow the exact same steps in docstring and encounter an error:

The error message was NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Thank you for allowing me to review the package. Hope me suggestions helps.

Thanks for your comments @lisebraaten and @taohuang-ubc, we really appreciate your time in sharing your thoughts to improve our Python package. And @lisebraaten, you are more than welcome to steal the idea to link both R and Python packages 😁

Additionally, @lisebraaten I just want to clarify a comment regarding to the final part of your second suggestion in the Review Comments section. Here you are getting different output in the forward_selection and recursive_feature_elimination because the Friedman function returns a random dataset where the first 5-features are related with the dependent variable 'y', and all additional variables in the dataset are independent from 'y'; so you should be getting different results but always it should contain some of the first five features. I believe we can attend your comment using a "random state" when calling the Friedman function in our examples, this would help to make them 100% reproducible.

Hi @vcuspinera! Makes sense, I thought randomness was potentially why but maybe it would be useful to clarify this in the README for users who aren't familiar with the Friedman function and how it generates a random dataset. You explained it really well above if you want to put a small note stating something similar (or a random state as you suggested above would also do the trick to make it fully reproducible).

@taohuang-ubc, thanks for your feedback! Regarding variance_thresholding,

Yes, we are calculating the variance of each feature.
Setting the threshold at 0 means we only get rid of features that have zero variation. This value is the most unopinionated and hence most appropriate default.
If an argument has a default, it is, by definition, optional. Also, vice versa, if you define an argument without a default, it is, by design, required. That is how Python works.

I hope my response clarified some of the confusion you had!

HI @jackyho112 , yea, it does, thanks.

Thanks @lisebraaten and @taohuang-ubc for your reviews. Here is a summary of all the changes we have made.

Please refer to release 1.1.8 for the changes summarized below.

Summary Responses

I believe scikit-learn v.0.22.2 should be listed as a dependency in the README (I was prompted to update my scikit-learn upon installing your package).

scikit-learn is only required for the tests. We have made the appropriate correcting in the pyproject.toml file.

When I ran the usage scenario for the recursive_feature_elimination function and forward_selection function I got the following error in both cases: NameError: name 'y' is not defined. Maybe the input to the function should be a capital Y instead of y? However, when I tried it with this change, I got a different output than the one in the README for both functions. Not entirely sure what is going on here but definitely something to look into.

We have made the appropriate corrections in the tests for consistent usage of Y in the README.md file

Not sure if scorer() needs to be defined twice in the README in the exact same way (for forward_selection and simulated_annealing). Might make more sense to name the one used in the recursive_feature_elimination example something different so you do not need to redefine the function again.

We decided to keep all the code for the four examples separate so that each example stands on its own.

For the variance_thresholding usage scenario, the "from" is missing in the line where the function gets imported (should read from feature_selection.variance_thresholding import variance_thresholding).

This has been corrected.

simulated_annealing function documentation on Read the Docs appears to be missing subsections (ie. parameters, returns, return type, examples). I see these are present in the docstring for the function but do not appear to be rendering.

The docstrings were fixed and the generated documentation now correctly shows the missing subsections.

recursive_feature_elimination produces output other than the index of selected features.

It would be nice if you can modify the function and let it only produce the indices, no the 5 10.

We have removed the debug code that was causing this.

I found 'variance_thresholding` is a bit hard to comprehend, maybe elaborate on the docstring? Are we calculating the variance of each features?

Docstrings were updated for clarity.

Also, for 'variance_thresholding`, I do not understand the reason why 0 was set as default threshold, I think, as long as the numbers are different, i.e, [1,0,0], the variance is always bigger than 0?

Setting the threshold at 0 means we only get rid of features that have zero variation. This value is the most unopinionated and hence most appropriate default.

Also, I guess you were intended to set 0 as default value for threshold, but in docstring, you wrote optional for that argument. Please fix :)

If an argument has a default, it is, by definition, optional. Also, vice versa, if you define an argument without a default, it is, by design, required.

For simulated_annealing, I follow the exact same steps in docstring and encounter an error:

The error message was NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

This was fixed in the example in the docstring.

UBC-MDS / software-review