Open techrah opened 4 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 1.5 hours
Hey! Awesome work everyone - I definitely see the usefulness of your package and enjoyed taking a look through it.
A few suggestions on things to update/look into:
recursive_feature_elimination
function and forward_selection
function I got the following error in both cases: NameError: name 'y' is not defined
. Maybe the input to the function should be a capital Y instead of y? However, when I tried it with this change, I got a different output than the one in the README for both functions. Not entirely sure what is going on here but definitely something to look into.scorer()
needs to be defined twice in the README in the exact same way (for forward_selection
and simulated_annealing
). Might make more sense to name the one used in the recursive_feature_elimination
example something different so you do not need to redefine the function again. variance_thresholding
usage scenario, the "from" is missing in the line where the function gets imported (should read from feature_selection.variance_thresholding import variance_thresholding
). simulated_annealing
function documentation on Read the Docs appears to be missing subsections (ie. parameters, returns, return type, examples). I see these are present in the docstring for the function but do not appear to be rendering.Thanks for giving me the opportunity to review your package! Please address the above suggestions before final approval. Awesome work :)
Review Template
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 1
Hi guys, excellent work overall. feature_selection
is definitely a good package to create. The documentation is well written. The following are my suggestions:
recursive_feature_elimination
produces output other than the index of selected features. It would be nice if you can modify the function and let it only produce the indices, no the 5 10
.
I found 'variance_thresholding` is a bit hard to comprehend, maybe elaborate on the docstring? Are we calculating the variance of each features?
Also, for 'variance_thresholding`, I do not understand the reason why 0 was set as default threshold, I think, as long as the numbers are different, i.e, [1,0,0], the variance is always bigger than 0?
Also, I guess you were intended to set 0 as default value for threshold, but in docstring, you wrote optional
for that argument. Please fix :)
For simulated_annealing
, I follow the exact same steps in docstring and encounter an error:
The error message was NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Thank you for allowing me to review the package. Hope me suggestions helps.
Thanks for your comments @lisebraaten and @taohuang-ubc, we really appreciate your time in sharing your thoughts to improve our Python package. And @lisebraaten, you are more than welcome to steal the idea to link both R and Python packages 😁
Additionally, @lisebraaten I just want to clarify a comment regarding to the final part of your second suggestion in the Review Comments section. Here you are getting different output in the forward_selection
and recursive_feature_elimination
because the Friedman function returns a random dataset where the first 5-features are related with the dependent variable 'y', and all additional variables in the dataset are independent from 'y'; so you should be getting different results but always it should contain some of the first five features. I believe we can attend your comment using a "random state" when calling the Friedman function in our examples, this would help to make them 100% reproducible.
Hi @vcuspinera! Makes sense, I thought randomness was potentially why but maybe it would be useful to clarify this in the README for users who aren't familiar with the Friedman function and how it generates a random dataset. You explained it really well above if you want to put a small note stating something similar (or a random state as you suggested above would also do the trick to make it fully reproducible).
@taohuang-ubc, thanks for your feedback! Regarding variance_thresholding
,
Yes, we are calculating the variance of each feature.
Setting the threshold at 0 means we only get rid of features that have zero variation. This value is the most unopinionated and hence most appropriate default.
If an argument has a default, it is, by definition, optional. Also, vice versa, if you define an argument without a default, it is, by design, required. That is how Python works.
I hope my response clarified some of the confusion you had!
HI @jackyho112 , yea, it does, thanks.
Thanks @lisebraaten and @taohuang-ubc for your reviews. Here is a summary of all the changes we have made.
Please refer to release 1.1.8 for the changes summarized below.
- I believe scikit-learn v.0.22.2 should be listed as a dependency in the README (I was prompted to update my scikit-learn upon installing your package).
scikit-learn
is only required for the tests. We have made the appropriate correcting in the pyproject.toml
file.
- When I ran the usage scenario for the
recursive_feature_elimination function
andforward_selection
function I got the following error in both cases:NameError: name 'y' is not defined
. Maybe the input to the function should be a capitalY
instead ofy
? However, when I tried it with this change, I got a different output than the one in theREADME
for both functions. Not entirely sure what is going on here but definitely something to look into.
We have made the appropriate corrections in the tests for consistent usage of Y
in the README.md
file
- Not sure if
scorer()
needs to be defined twice in theREADME
in the exact same way (forforward_selection
andsimulated_annealing
). Might make more sense to name the one used in therecursive_feature_elimination
example something different so you do not need to redefine the function again.
We decided to keep all the code for the four examples separate so that each example stands on its own.
- For the
variance_thresholding
usage scenario, the "from" is missing in the line where the function gets imported (should read fromfeature_selection.variance_thresholding import variance_thresholding
).
This has been corrected.
simulated_annealing
function documentation on Read the Docs appears to be missing subsections (ie. parameters, returns, return type, examples). I see these are present in the docstring for the function but do not appear to be rendering.
The docstrings were fixed and the generated documentation now correctly shows the missing subsections.
recursive_feature_elimination
produces output other than the index of selected features.It would be nice if you can modify the function and let it only produce the indices, no the
5 10
.
We have removed the debug code that was causing this.
- I found 'variance_thresholding` is a bit hard to comprehend, maybe elaborate on the docstring? Are we calculating the variance of each features?
Docstrings were updated for clarity.
- Also, for 'variance_thresholding`, I do not understand the reason why 0 was set as default threshold, I think, as long as the numbers are different, i.e, [1,0,0], the variance is always bigger than 0?
Setting the threshold at 0 means we only get rid of features that have zero variation. This value is the most unopinionated and hence most appropriate default.
- Also, I guess you were intended to set 0 as default value for threshold, but in docstring, you wrote
optional
for that argument. Please fix :)
If an argument has a default, it is, by definition, optional. Also, vice versa, if you define an argument without a default, it is, by design, required.
- For
simulated_annealing
, I follow the exact same steps in docstring and encounter an error:
- The error message was
NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
This was fixed in the example in the docstring.
Submitting Author: Ryan Homer (@ryanhomer), Jacky Ho (@jackyho112), Derek Kruszewski (@dkruszew), Victor Cuspinera (@vcuspinera) Package Name: feature-selection One-Line Description of Package: Feature Selection for Machine Learning Models Repository Link: https://github.com/UBC-MDS/feature-selection-python Version submitted: 1.1.2 Editor: Varada Kolhatkar (@kvarada) Reviewer 1: Lise Braaten (@lisebraaten) Reviewer 2: Tao Huang (@taohuang-ubc) Archive: TBD Version accepted: TBD
Description
Include a brief paragraph describing what your package does:
Feature selection in machine learning will reduce complexity, reduce the time when training an algorithm, and improve the accuracy of your model (if selected wisely). However, this is not a trivial task. The feature-selection Python package implements feature selection algorithms that can work with existing machine learning model such as those from scikit-learn.
Scope
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
Explain how the and why the package falls under these categories (briefly, 1-2 sentences):
This package implements data feature selections algorithms. It is expected that the user will make use of packages such as [caret][2] in order to do the actual model fitting and scoring. This package then makes use of these results to carry out feature selection.
Who is the target audience and what are scientific applications of this package?
It is expected that this package will be helpful to mainly data scientists, students of data science, and in general, anyone involved in machine learning.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?
Some of the feature selection algorithms exists on the Python platform. Others either do not have an implementation or their implemetation is fairly complex. This package aims to provide easy-to-use implementations of feature selection algorithms and to provide a more seamless experience between the R and Python platforms by providing a companion R edition that works in a very similar way.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tag
the editor you contacted:n/a
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor and review templates can be found here