Closed cgostic closed 2 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Estimated hours spent reviewing:
Overall I think you have done an excellent job with this packages. I think it addresses a specific need, and your documentation makes it very easy to understand how the package works. All of the feedback I have is pretty minor. Once addressed, I would feel confident approving this package.
Issue 1: Examples in README do not work as shown. For example:
>>> from pylaundry import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
... 'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
... 'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
NameError: name 'pylaundry' is not defined
I think the issue is regarding how you are importing. If I do the following, it works:
>>> from pylaundry.categorize import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
... 'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
... 'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
{'numeric': ['b'], 'categorical': ['c', 'a']}
I think this will hold true with all of your examples.
Issue 2: Typos
There are several typos in the REAMDE. I quick spell check should fix this. For example training
has been spelt incorrectly in several places as trianing
. Please run spell check on the entire document.
Issue 3: Missing examples in doc strings
fill_missing()
has no example in the doc stringselect_features
has no example in the doc stringtransform_columns
has no example in the doc stringIssue 4: Other options for checking column types
In your README you mention you are not aware of any other packages for categorizing columns. You may want to mention pandas.DataFrame.dtypes
which does this. However, I think your function is still useful, as it returns a dictionary, where as the built in pandas method returns a pandas.Series
.
Issue 5: Excessive functions exposed to the user in documentation.
Looking at readthedocs there are extra functions here exposed to the user. I would expect to only see four. Is there a way to hide the ones that should not be user facing? Also some functions appear twice, making it a bit more confusing to read.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Estimated hours spent reviewing:
I was very impressed to see how much work and effort you have put into this package. It looks really professional. The documentation covers a lot of details about how it works. I especially like that you have included and workflow of how all the functions can be used in an integrated form.
Issue 1: Dependency Sklearn==0.22 caused installation problems:
I actually had 0.22.1 and when you try to install you get an error message (see below) because you need scikit-learn==0.22.2. Here is the error I got:
ERROR: Could not find a version that satisfies the requirement scikit-learn<0.23.0,>=0.22.2 (from pylaundry) (from versions: 0.20.3, 0.21.0, 0.22rc2.post1, 0.22)
ERROR: No matching distribution found for scikit-learn<0.23.0,>=0.22.2 (from pylaundry)
And here I solved it:
1) pip install "sklearn==0.22.2"
2) pip install python-semantic-release
My recommendation is to add a section under installation guidelines for installation troubleshooting with the error and the solution.
Action:
Issue 2: Examples in README with output or input problems:
>>> from pylaundry.categorize import categorize
>>> import pandas
>>>
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
... 'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
... 'c':['A','B','C','D','E','F','G','H','I','J']})
>>>
>>> # categorize with default max_cat (10)
>>> categorize(df)
README OUTPUT:
{'numeric': ['b'],
'categorical': ['a', 'b']}
ACTUAL OUTPUT:
{'numeric': ['b'],
'categorical': ['c', 'a']}
>>>from pylaundry.fill_missing import fill_missing
>>>import pandas
>>>import numpy as np
>>>df_train = pandas.DataFrame({'a':[1, 2, np.NaN, 4, 4],
>>> 'b':[1.2, 3.4, 3.0, 4.9, np.NaN]})
>>>df_test = pandas.DataFrame({'a':[6, np.NaN, 0],
>>> 'b':[0.5, 9.2, np.NaN]})
>>>fill_missing(df_train, df_test, {'numeric':['b'], 'categorical':['a']},
>>> num_imp = 'median')
There is a cat_imp missing argument that throws an error:
TypeError: fill_missing() missing 1 required positional argument: 'cat_imp'
Action:
Issue 3: Besides the missing examples in docstrings for some functions as mentioned by SAM (the other reviewer), the example for the categorize function did not work properly:
categorize(pd.DataFrame({‘numeric’:[1,2,3,4], ‘cat’:[‘a’,’b’,’c’]}))
Here is the error message: "SyntaxError: invalid character in identifier"
Action:
Issue 4: Proposed end-to-end workflow:
Snaphot of first error:
Snapshot of second error:
Action:
Issue 5: n_features in select_features fucntion:
The Select features function actually provides features when number of features (n_features) argument is greater than true number of features. This is not mentioned in the docstrings or no error message is output from the function:
Here is a snapshot of the issue:
Action:
Hi @robilizando and @SamEdwardes, First of all, thank you very much for spending the time to review our package so thoroughly! Both of you gave us a lot of specific, actionable and helpful feedback which we have attempted to address, as I demonstrate below.
Sam's issues:
pandas.DataFrame.dtypes
as a similar function. Robert's Issues:
scikit-learn
and not sklearn
as we had previously, so installing works now. Thanks for also noting a potential fix!
Thanks again for taking the effort to review pylaundry
, our package is certainly much sleeker now as a result of your suggestions!. The link of the package with latest release is here
Submitting Author: Name (@cgostic, @zanderhinton, @amank90, @arunmarria)
Package Name: pylaundry One-Line Description of Package: Perform standard preprocessing of a dataframe to be used in machine learning algorithms in a streamlined workflow. Repository Link: https://github.com/UBC-MDS/pylaundry Version submitted: 1.0.8 Editor: @kvarada Reviewer 1: @SamEdwardes Reviewer 2: @robilizando Archive: TBD
Version accepted: TBD
Description
The
pylaundry
package performs many standard preprocessing techniques for Pandas dataframes, before use in statistical analysis and machine learning. The package functionality includes categorizing column types, handling missing data and imputation, transforming/standardizing columns and feature selection. Thepylaundry
package aims to remove much of the grunt work in the typical data science workflow, allowing the analyst maximum time and energy to devote to modelling!Scope
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
All functions in this package take a Pandas DataFrame as an input. Two of the functions return transformed DataFrames (filled missing values, encoded and scaled), and the other two functions return information about the data gleaned from the DataFrame itself (column types, and most important features).
Pylaundry is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a
.fit
method, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.sklearn.Pipeline offers similar functionality for the fill_missing and transform_columns functions, where similar functions can be wrapped in a Pipeline and carried out sequentially. However, creating a pipeline is a several step process depending on how many types of transformations need to be performed. pylaundry can accomplish numerical and categorical transformations in one step.
There are many feature selection packages and functions, for instance sklearn.feature_selection, which carry out similar functionality to our
feature_selector
function. Pylaundry simplifies feature selection by linear and logistic regression into a single function.As far as we know, there are no similar packages for Categorizing Columns.
pylaundry
is the first package we are aware of to abstract away the full dataframe pre-processing workflow with a unified and simple API.@tag
the editor you contacted:Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct