UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: pylaundry (Python) #14

Closed cgostic closed 2 years ago

cgostic commented 4 years ago

Submitting Author: Name (@cgostic, @zanderhinton, @amank90, @arunmarria)
Package Name: pylaundry One-Line Description of Package: Perform standard preprocessing of a dataframe to be used in machine learning algorithms in a streamlined workflow. Repository Link: https://github.com/UBC-MDS/pylaundry Version submitted: 1.0.8 Editor: @kvarada Reviewer 1: @SamEdwardes Reviewer 2: @robilizando Archive: TBD
Version accepted: TBD


Description

The pylaundry package performs many standard preprocessing techniques for Pandas dataframes, before use in statistical analysis and machine learning. The package functionality includes categorizing column types, handling missing data and imputation, transforming/standardizing columns and feature selection. The pylaundry package aims to remove much of the grunt work in the typical data science workflow, allowing the analyst maximum time and energy to devote to modelling!

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

All functions in this package take a Pandas DataFrame as an input. Two of the functions return transformed DataFrames (filled missing values, encoded and scaled), and the other two functions return information about the data gleaned from the DataFrame itself (column types, and most important features).

Pylaundry is made for data scientists, or anyone applying statistical methods or machine learning algorithms to their data. It transforms a dataset into a format that is ready to be passed into a .fit method, with all NAs imputed, categorical columns encoded, numerical columns scaled, and important features identified.

sklearn.Pipeline offers similar functionality for the fill_missing and transform_columns functions, where similar functions can be wrapped in a Pipeline and carried out sequentially. However, creating a pipeline is a several step process depending on how many types of transformations need to be performed. pylaundry can accomplish numerical and categorical transformations in one step.

There are many feature selection packages and functions, for instance sklearn.feature_selection, which carry out similar functionality to our feature_selector function. Pylaundry simplifies feature selection by linear and logistic regression into a single function.

As far as we know, there are no similar packages for Categorizing Columns. pylaundry is the first package we are aware of to abstract away the full dataframe pre-processing workflow with a unified and simple API.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

SamEdwardes commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

General Comments

Overall I think you have done an excellent job with this packages. I think it addresses a specific need, and your documentation makes it very easy to understand how the package works. All of the feedback I have is pretty minor. Once addressed, I would feel confident approving this package.

Specific Issues

Issue 1: Examples in README do not work as shown. For example:

>>> from pylaundry import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
...                        'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
...                        'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
NameError: name 'pylaundry' is not defined

I think the issue is regarding how you are importing. If I do the following, it works:

>>> from pylaundry.categorize import categorize
>>> import pandas
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
...                        'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
...                        'c':['A','B','C','D','E','F','G','H','I','J']})
>>> # categorize with default max_cat (10)
>>> pylaundry.categorize(df)
{'numeric': ['b'], 'categorical': ['c', 'a']}

I think this will hold true with all of your examples.

Issue 2: Typos

There are several typos in the REAMDE. I quick spell check should fix this. For example training has been spelt incorrectly in several places as trianing. Please run spell check on the entire document.

Issue 3: Missing examples in doc strings

Issue 4: Other options for checking column types

In your README you mention you are not aware of any other packages for categorizing columns. You may want to mention pandas.DataFrame.dtypes which does this. However, I think your function is still useful, as it returns a dictionary, where as the built in pandas method returns a pandas.Series.

Issue 5: Excessive functions exposed to the user in documentation.

Looking at readthedocs there are extra functions here exposed to the user. I would expect to only see four. Is there a way to hide the ones that should not be user facing? Also some functions appear twice, making it a bit more confusing to read.

Screen Shot 2020-03-16 at 3 09 12 PM

robilizando commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

General Overview Comments

I was very impressed to see how much work and effort you have put into this package. It looks really professional. The documentation covers a lot of details about how it works. I especially like that you have included and workflow of how all the functions can be used in an integrated form.

Specific Issues

Issue 1: Dependency Sklearn==0.22 caused installation problems:

I actually had 0.22.1 and when you try to install you get an error message (see below) because you need scikit-learn==0.22.2. Here is the error I got:

ERROR: Could not find a version that satisfies the requirement scikit-learn<0.23.0,>=0.22.2 (from pylaundry) (from versions: 0.20.3, 0.21.0, 0.22rc2.post1, 0.22)
ERROR: No matching distribution found for scikit-learn<0.23.0,>=0.22.2 (from pylaundry)

And here I solved it:

1) pip install "sklearn==0.22.2"
2) pip install python-semantic-release

My recommendation is to add a section under installation guidelines for installation troubleshooting with the error and the solution.

Action:

Issue 2: Examples in README with output or input problems:

>>> from pylaundry.categorize import categorize
>>> import pandas
>>>
>>> df = pandas.DataFrame({'a':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
...                        'b':[1.2, 3.4, 3.0, 4.9, 5.3, 6.1, 8.8, 9.4, 10.4, 1.2],
...                        'c':['A','B','C','D','E','F','G','H','I','J']})
>>>
>>> # categorize with default max_cat (10)
>>> categorize(df)
README OUTPUT:
{'numeric': ['b'], 
'categorical': ['a', 'b']}

ACTUAL OUTPUT:
{'numeric': ['b'], 
'categorical': ['c', 'a']}
>>>from pylaundry.fill_missing import fill_missing
>>>import pandas
>>>import numpy as np

>>>df_train = pandas.DataFrame({'a':[1, 2, np.NaN, 4, 4],
>>>                             'b':[1.2, 3.4, 3.0, 4.9, np.NaN]})

>>>df_test = pandas.DataFrame({'a':[6, np.NaN, 0],
>>>                           'b':[0.5, 9.2, np.NaN]})

>>>fill_missing(df_train, df_test, {'numeric':['b'], 'categorical':['a']}, 
>>>             num_imp = 'median')

There is a cat_imp missing argument that throws an error:

TypeError: fill_missing() missing 1 required positional argument: 'cat_imp'

Action:

Issue 3: Besides the missing examples in docstrings for some functions as mentioned by SAM (the other reviewer), the example for the categorize function did not work properly:

categorize(pd.DataFrame({‘numeric’:[1,2,3,4], ‘cat’:[‘a’,’b’,’c’]}))

Here is the error message: "SyntaxError: invalid character in identifier"

Action:

Issue 4: Proposed end-to-end workflow:

Snaphot of first error: image

Snapshot of second error: image

Action:

Issue 5: n_features in select_features fucntion:

The Select features function actually provides features when number of features (n_features) argument is greater than true number of features. This is not mentioned in the docstrings or no error message is output from the function:

Here is a snapshot of the issue: image

Action:

zanderhinton commented 4 years ago

Hi @robilizando and @SamEdwardes, First of all, thank you very much for spending the time to review our package so thoroughly! Both of you gave us a lot of specific, actionable and helpful feedback which we have attempted to address, as I demonstrate below.

Sam's issues:

  1. Examples in README do not work as shown.
    • We have taken much more care in designing our README examples, and they now all work as shown.
  2. Typos.
    • We have passed the package through spell-check, and the typos have been addressed.
  3. Missing examples.
    • We have added examples to docstrings in all functions.
  4. Other options for checking column types
    • We have included pandas.DataFrame.dtypes as a similar function.
  5. Excessive functions exposed to the user in documentation.
    • We have removed the non-user facing functions (WrongData and WrongDataType) from the documentation, which can be seen here Docs

Robert's Issues:

  1. Dependency Sklearn==0.22 caused installation problems.
    • We have updated our dependencies to include scikit-learn and not sklearn as we had previously, so installing works now. Thanks for also noting a potential fix!
      1. Examples in README with output or input problems
    • We have taken much more care in designing our README examples, which now work as shown.
      1. The example for the categorize function did not work properly.
    • This example has been update and now runs properly.
      1. Issue 4: Proposed end-to-end workflow:
    • We have updated the workflow so the example runs properly.
      1. Issue 5: n_features in select_features function
    • There has been additional line added to the docstring of this function mentioning this.

Thanks again for taking the effort to review pylaundry, our package is certainly much sleeker now as a result of your suggestions!. The link of the package with latest release is here