UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: feature-selection (Python) #36

Open techrah opened 4 years ago

techrah commented 4 years ago

Submitting Author: Ryan Homer (@ryanhomer), Jacky Ho (@jackyho112), Derek Kruszewski (@dkruszew), Victor Cuspinera (@vcuspinera) Package Name: feature-selection One-Line Description of Package: Feature Selection for Machine Learning Models Repository Link: https://github.com/UBC-MDS/feature-selection-python Version submitted: 1.1.2 Editor: Varada Kolhatkar (@kvarada) Reviewer 1: Lise Braaten (@lisebraaten) Reviewer 2: Tao Huang (@taohuang-ubc) Archive: TBD Version accepted: TBD


Description

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

lisebraaten commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hours


Review Comments

Hey! Awesome work everyone - I definitely see the usefulness of your package and enjoyed taking a look through it.

A few suggestions on things to update/look into:

Thanks for giving me the opportunity to review your package! Please address the above suggestions before final approval. Awesome work :)

taohuang-ubc commented 4 years ago

Review Template

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1


Review Comments

Hi guys, excellent work overall. feature_selection is definitely a good package to create. The documentation is well written. The following are my suggestions:

Screen Shot 2020-03-20 at 2 28 27 PM

It would be nice if you can modify the function and let it only produce the indices, no the 5 10.

Screen Shot 2020-03-20 at 2 51 34 PM

The error message was NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Thank you for allowing me to review the package. Hope me suggestions helps.

vcuspinera commented 4 years ago

Thanks for your comments @lisebraaten and @taohuang-ubc, we really appreciate your time in sharing your thoughts to improve our Python package. And @lisebraaten, you are more than welcome to steal the idea to link both R and Python packages 😁

Additionally, @lisebraaten I just want to clarify a comment regarding to the final part of your second suggestion in the Review Comments section. Here you are getting different output in the forward_selection and recursive_feature_elimination because the Friedman function returns a random dataset where the first 5-features are related with the dependent variable 'y', and all additional variables in the dataset are independent from 'y'; so you should be getting different results but always it should contain some of the first five features. I believe we can attend your comment using a "random state" when calling the Friedman function in our examples, this would help to make them 100% reproducible.

lisebraaten commented 4 years ago

Hi @vcuspinera! Makes sense, I thought randomness was potentially why but maybe it would be useful to clarify this in the README for users who aren't familiar with the Friedman function and how it generates a random dataset. You explained it really well above if you want to put a small note stating something similar (or a random state as you suggested above would also do the trick to make it fully reproducible).

jackyho112 commented 4 years ago

@taohuang-ubc, thanks for your feedback! Regarding variance_thresholding,

  1. Yes, we are calculating the variance of each feature.

  2. Setting the threshold at 0 means we only get rid of features that have zero variation. This value is the most unopinionated and hence most appropriate default.

  3. If an argument has a default, it is, by definition, optional. Also, vice versa, if you define an argument without a default, it is, by design, required. That is how Python works.

I hope my response clarified some of the confusion you had!

taohuang-ubc commented 4 years ago

HI @jackyho112 , yea, it does, thanks.

techrah commented 4 years ago

Thanks @lisebraaten and @taohuang-ubc for your reviews. Here is a summary of all the changes we have made.

Please refer to release 1.1.8 for the changes summarized below.

Summary Responses

  • I believe scikit-learn v.0.22.2 should be listed as a dependency in the README (I was prompted to update my scikit-learn upon installing your package).

scikit-learn is only required for the tests. We have made the appropriate correcting in the pyproject.toml file.

  • When I ran the usage scenario for the recursive_feature_elimination function and forward_selection function I got the following error in both cases: NameError: name 'y' is not defined. Maybe the input to the function should be a capital Y instead of y? However, when I tried it with this change, I got a different output than the one in the README for both functions. Not entirely sure what is going on here but definitely something to look into.

We have made the appropriate corrections in the tests for consistent usage of Y in the README.md file

  • Not sure if scorer() needs to be defined twice in the README in the exact same way (for forward_selection and simulated_annealing). Might make more sense to name the one used in the recursive_feature_elimination example something different so you do not need to redefine the function again.

We decided to keep all the code for the four examples separate so that each example stands on its own.

  • For the variance_thresholding usage scenario, the "from" is missing in the line where the function gets imported (should read from feature_selection.variance_thresholding import variance_thresholding).

This has been corrected.

  • simulated_annealing function documentation on Read the Docs appears to be missing subsections (ie. parameters, returns, return type, examples). I see these are present in the docstring for the function but do not appear to be rendering.

The docstrings were fixed and the generated documentation now correctly shows the missing subsections.

  • recursive_feature_elimination produces output other than the index of selected features.
Screen Shot 2020-03-20 at 2 28 27 PM

It would be nice if you can modify the function and let it only produce the indices, no the 5 10.

We have removed the debug code that was causing this.

  • I found 'variance_thresholding` is a bit hard to comprehend, maybe elaborate on the docstring? Are we calculating the variance of each features?

Docstrings were updated for clarity.

  • Also, for 'variance_thresholding`, I do not understand the reason why 0 was set as default threshold, I think, as long as the numbers are different, i.e, [1,0,0], the variance is always bigger than 0?

Setting the threshold at 0 means we only get rid of features that have zero variation. This value is the most unopinionated and hence most appropriate default.

  • Also, I guess you were intended to set 0 as default value for threshold, but in docstring, you wrote optional for that argument. Please fix :)

If an argument has a default, it is, by definition, optional. Also, vice versa, if you define an argument without a default, it is, by design, required.

  • For simulated_annealing, I follow the exact same steps in docstring and encounter an error:
Screen Shot 2020-03-20 at 2 51 34 PM
  • The error message was NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

This was fixed in the example in the docstring.