Submission: pylaundry (Python)

Submitting Author: Name (@cgostic, @zanderhinton, @amank90, @arunmarria)
Package Name: pylaundry One-Line Description of Package: Perform standard preprocessing of a dataframe to be used in machine learning algorithms in a streamlined workflow. Repository Link: https://github.com/UBC-MDS/pylaundry Version submitted: 1.0.8 Editor: @kvarada Reviewer 1: @SamEdwardes Reviewer 2: @robilizando Archive: TBD
Version accepted: TBD

Description

The pylaundry package performs many standard preprocessing techniques for Pandas dataframes, before use in statistical analysis and machine learning. The package functionality includes categorizing column types, handling missing data and imputation, transforming/standardizing columns and feature selection. The pylaundry package aims to remove much of the grunt work in the typical data science workflow, allowing the analyst maximum time and energy to devote to modelling!

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

All functions in this package take a Pandas DataFrame as an input. Two of the functions return transformed DataFrames (filled missing values, encoded and scaled), and the other two functions return information about the data gleaned from the DataFrame itself (column types, and most important features).

Who is the target audience and what are scientific applications of this package?

Pylaundry is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a .fit method, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

sklearn.Pipeline offers similar functionality for the fill_missing and transform_columns functions, where similar functions can be wrapped in a Pipeline and carried out sequentially. However, creating a pipeline is a several step process depending on how many types of transformations need to be performed. pylaundry can accomplish numerical and categorical transformations in one step.

There are many feature selection packages and functions, for instance sklearn.feature_selection, which carry out similar functionality to our feature_selector function. Pylaundry simplifies feature selection by linear and logistic regression into a single function.

As far as we know, there are no similar packages for Categorizing Columns. pylaundry is the first package we are aware of to abstract away the full dataframe pre-processing workflow with a unified and simple API.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

30 minutes

Review Comments

General Comments

Overall I think you have done an excellent job with this packages. I think it addresses a specific need, and your documentation makes it very easy to understand how the package works. All of the feedback I have is pretty minor. Once addressed, I would feel confident approving this package.

Specific Issues

Issue 1: Examples in README do not work as shown. For example:

>>> from pylaundry import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
...                        'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
...                        'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
NameError: name 'pylaundry' is not defined

I think the issue is regarding how you are importing. If I do the following, it works:

>>> from pylaundry.categorize import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
...                        'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
...                        'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
{'numeric': ['b'], 'categorical': ['c', 'a']}

I think this will hold true with all of your examples.

Issue 2: Typos

There are several typos in the REAMDE. I quick spell check should fix this. For example training has been spelt incorrectly in several places as trianing. Please run spell check on the entire document.

Issue 3: Missing examples in doc strings

fill_missing() has no example in the doc string
select_features has no example in the doc string
transform_columnshas no example in the doc string

Issue 4: Other options for checking column types

In your README you mention you are not aware of any other packages for categorizing columns. You may want to mention pandas.DataFrame.dtypes which does this. However, I think your function is still useful, as it returns a dictionary, where as the built in pandas method returns a pandas.Series.

Issue 5: Excessive functions exposed to the user in documentation.

Looking at readthedocs there are extra functions here exposed to the user. I would expect to only see four. Is there a way to hide the ones that should not be user facing? Also some functions appear twice, making it a bit more confusing to read.

Screen Shot 2020-03-16 at 3 09 12 PM

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

30 minutes

Review Comments

General Overview Comments

I was very impressed to see how much work and effort you have put into this package. It looks really professional. The documentation covers a lot of details about how it works. I especially like that you have included and workflow of how all the functions can be used in an integrated form.

Specific Issues

Issue 1: Dependency Sklearn==0.22 caused installation problems:

I actually had 0.22.1 and when you try to install you get an error message (see below) because you need scikit-learn==0.22.2. Here is the error I got:

ERROR: Could not find a version that satisfies the requirement scikit-learn<0.23.0,>=0.22.2 (from pylaundry) (from versions: 0.20.3, 0.21.0, 0.22rc2.post1, 0.22)
ERROR: No matching distribution found for scikit-learn<0.23.0,>=0.22.2 (from pylaundry)

And here I solved it:

1) pip install "sklearn==0.22.2"
2) pip install python-semantic-release

My recommendation is to add a section under installation guidelines for installation troubleshooting with the error and the solution.

Action:

[] Update installation documentation to reflect error and solution.

Issue 2: Examples in README with output or input problems:

Example 1 categorize function:

>>> from pylaundry.categorize import categorize
>>> import pandas
>>>
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
...                        'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
...                        'c':['A','B','C','D','E','F','G','H','I','J']})
>>>
>>> # categorize with default max_cat (10)
>>> categorize(df)
README OUTPUT:
{'numeric': ['b'], 
'categorical': ['a', 'b']}

ACTUAL OUTPUT:
{'numeric': ['b'], 
'categorical': ['c', 'a']}

Example 1 fill_missing function:

>>>from pylaundry.fill_missing import fill_missing
>>>import pandas
>>>import numpy as np

>>>df_train = pandas.DataFrame({'a':[1, 2, np.NaN, 4, 4],
>>>                             'b':[1.2, 3.4, 3.0, 4.9, np.NaN]})

>>>df_test = pandas.DataFrame({'a':[6, np.NaN, 0],
>>>                           'b':[0.5, 9.2, np.NaN]})

>>>fill_missing(df_train, df_test, {'numeric':['b'], 'categorical':['a']}, 
>>>             num_imp = 'median')

There is a cat_imp missing argument that throws an error:

TypeError: fill_missing() missing 1 required positional argument: 'cat_imp'

Action:

[] Update README file with corrected examples inputs or out puts

Issue 3: Besides the missing examples in docstrings for some functions as mentioned by SAM (the other reviewer), the example for the categorize function did not work properly:

categorize(pd.DataFrame({‘numeric’:[1,2,3,4], ‘cat’:[‘a’,’b’,’c’]}))

Here is the error message: "SyntaxError: invalid character in identifier"

Action:

[] Update examples for functions in docstrings with proper working coding

Issue 4: Proposed end-to-end workflow:

Shouldn't this be named begin-to-end workflow?
This workflow is great to give a full overview of how the package is supposed to be implemented. However, there are still errors in that need to be corrected so users can copy and paste and adjust. Right now if you run it like that it gives you variable no define errors, arguments missing etc.

Snaphot of first error:

Snapshot of second error:

Action:

[] Optional Change name to end-to-end workflow
[] Update coding so the workflow works properly.

Issue 5: n_features in select_features fucntion:

The Select features function actually provides features when number of features (n_features) argument is greater than true number of features. This is not mentioned in the docstrings or no error message is output from the function:

Here is a snapshot of the issue:

Action:

[] Update docstring to reflect this condition if this the desired output by designed OR
[] (Optional) Update function to throw an error when user input more features than there are.

Hi @robilizando and @SamEdwardes, First of all, thank you very much for spending the time to review our package so thoroughly! Both of you gave us a lot of specific, actionable and helpful feedback which we have attempted to address, as I demonstrate below.

Sam's issues:

Examples in README do not work as shown.
- We have taken much more care in designing our README examples, and they now all work as shown.
Typos.
- We have passed the package through spell-check, and the typos have been addressed.
Missing examples.
- We have added examples to docstrings in all functions.
Other options for checking column types
- We have included pandas.DataFrame.dtypes as a similar function.
Excessive functions exposed to the user in documentation.
- We have removed the non-user facing functions (WrongData and WrongDataType) from the documentation, which can be seen here Docs

Robert's Issues:

Dependency Sklearn==0.22 caused installation problems.
- We have updated our dependencies to include scikit-learn and not sklearn as we had previously, so installing works now. Thanks for also noting a potential fix!
  1. Examples in README with output or input problems
- We have taken much more care in designing our README examples, which now work as shown.
  1. The example for the categorize function did not work properly.
- This example has been update and now runs properly.
  1. Issue 4: Proposed end-to-end workflow:
- We have updated the workflow so the example runs properly.
  1. Issue 5: n_features in select_features function
- There has been additional line added to the docstring of this function mentioning this.

Thanks again for taking the effort to review pylaundry, our package is certainly much sleeker now as a result of your suggestions!. The link of the package with latest release is here

UBC-MDS / software-review