Submission: sktidy (Python)

Submitting Author: Jacob McFarlane @JacobMcFarlane, Asma Al-Odaini (@anodaini), Xudong Yang @xudongyang2, Heidi Ye @heidi-ye Package Name: sktidy One-Line Description of Package: Tidy model output for sklearn's LogisticRegression and KMeans Repository Link: sktidy Version submitted: 0.1.1 Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Description

sktidy is a python package that returns a tidy summary output to sklearn LinearRegression and KMeans models using the functions tidy_lr() and tidy_kmeans(). It also outputs the predictions of the model for the original data using the functions augment_lr() and augment_kmeans() for LinearRegression and KMeans respectively.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [x] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

The package falls under data extraction because it retrieves data from the models and presents it in a structured way. In addition, it falls under data munging because it transforms the data into a more accessible and appropriate form.
Who is the target audience and what are scientific applications of this package?

The target audience would be people that want to use sklearn models and have one package that outputs model results in a tidy format similar to who the R package broom.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are not current packages that implement the same thing. pybroom implements a similar functionality but not for sklearn models.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[ ] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally Please see note at the end regarding tidy_kmeans - vignette and README example seems to throw an error.
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented. Yes, but please see note below.
[ ] Functionality: Any functional claims of the software been confirmed. Please see details below regarding a question on implementation of tidy_kmeans()
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the locl machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hours

Review Comments

Since the package is currently hosted on test.pypi.org, installations instructions should include the --extra-index-url argument to allow for installation of dependencies.
```
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple sktidy
```
Based on the the package folder structure, I believe the import statements should say

import sktidy.sktidy

as opposed to

import sktidy

tidy_kmeans current implementation seems to only work when the number of clusters in KMeans is 2. If this is intentional, please update the documentation to reflect this. The folowing example in the README uses the default argument for n_clusters which is 8.

# Importing packages
from sklearn.cluster import DBSCAN, KMeans
from sklearn import datasets
import pandas as pd
import sktidy
# Extracting data and training the clustering algorithm
df = datasets.load_iris(return_X_y = True, as_frame = True)[0]
kmeans_clusterer = KMeans()
kmeans_clusterer.fit(df)
# Getting the tidy df of cluster information
tidy_kmeans(model = kmeans_clusterer, X = df)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-eaa8c2d1fa7b> in <module>
      9 kmeans_clusterer.fit(df)
     10 # Getting the tidy df of cluster information
---> 11 tidy_kmeans(model = kmeans_clusterer, X = df)

~/miniconda3/envs/563/lib/python3.8/site-packages/sktidy/sktidy.py in tidy_kmeans(model, X)
    155         # Getting the cluster center for the given each cluster, reshaping it \
    156         # so pandas behaves itself later
--> 157         cluster_center = model.cluster_centers_[cluster].reshape(
    158             1, cluster_labels.shape[0]
    159         )

ValueError: cannot reshape array of size 4 into shape (1,8)

GitHub repository does not seem to be connected to Codecov and the badge on README is not showing coverage.
This one might be out of your control, but the patsy library version that your package uses has a deprecated import which is throwing an error during pytest. I would check if there are newer versions of the package that could be used instead.

====================================================== warnings summary ====================================================
../../../../../.cache/pypoetry/virtualenvs/sktidy-ENUwfNFi-py3.8/lib/python3.8/site-packages/patsy/constraint.py:13
  /home/yazan/.cache/pypoetry/virtualenvs/sktidy-ENUwfNFi-py3.8/lib/python3.8/site-packages/patsy/constraint.py:13: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
    from collections import Mapping

-- Docs: https://docs.pytest.org/en/stable/warnings.html
==================================================== 4 passed, 1 warning in 0.95s =========================================

I believe your choice to use MIT license falls under Open Source Initiative licenses so you can check off this box has an OSI approved license. in your template above

Great work on the package overall. Could potentially see myself using it one day!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
The usage example of tidy_kmeans() is not working properly, please see comment section for detail.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2-3 hrs

Review Comments

Hi Asma, Heidi, Jacob, and Peter:

Great job on this package, it's a really good tool to prepare us for analysis on the linear regression and KMeans model. I would definitely use this tool.

I just had a few comments on the package.

Summary, Feature/Function Description and Function Docstring:

There seems to be some mismatches among the summary, feature description, the function docstrinig and its output. For example, inertia and silhoutte scores for kmeans clustering seems to be no longer the output for the augment function of KMeans, but it shows up in summary, feature description and docstring of tidy_kmeans(). The team may want to update the summary, feature description and docstring of the functions.

Badges

This is a really minor issue. The codecov badge in GitHub repo is now showing status of unknown. It might due to typos in the hyperlink.

Usage Instruction & Example (on README & Read the Docs)

It would be great if we have sample output for each function. We can consider breaking the 4 functions into 4 different code blocks and show their respective output to further help users to visualize the usage of the function.
To import the package, the instructed import sktidy doesn't seem to work. The following command would work.

from sktidy import sktidy

The function tidy_kmeans() does not run properly with the sample code. It prompts the following error. It may relate to the shape of the dataframe or the number of cluster argument for KMeans(). Great if the team can look into it.

ValueError                                Traceback (most recent call last)
<ipython-input-75-28772663a5ae> in <module>
      1 kmeans_clusterer = KMeans()
      2 kmeans_clusterer.fit(df)
----> 3 sk.tidy_kmeans(model = kmeans_clusterer, X = df)

~/Documents/mds/block5/524/DSCI_524_collab-sw-dev_students/sktidy/sktidy/sktidy.py in tidy_kmeans(model, X)
    155         # Getting the cluster center for the given each cluster, reshaping it \
    156         # so pandas behaves itself later
--> 157         cluster_center = model.cluster_centers_[cluster].reshape(
    158             1, cluster_labels.shape[0]
    159         )

ValueError: cannot reshape array of size 4 into shape (1,8)

Test Script

It will be great if we have a brief docstrings for each test function when we have the dummy data and 4 test functions.

Potential Future Improvement

I love how tidy_lr() eases off the hassle for extracting feature names and its coefficient value for feature importance analysis. It would be great if the function can further support other types of regression models, and pipeline object with steps such as pre-processing (e.g. CountVectorizer) or feature transformation/selection. The latter one has been a very tiresome task.
For augment_kmeans, it would be cool and even more convenient if we can augment the predicted cluster label for multiple KMeans Model with different hyperparameters (e.g. different n_clusters) so that we won't have different augmented dataset when we tried different hyperparameter for the k-mean models.

Overall, the package works well. I am looking forward to the future version.

Let me know if there's anything unclear.

UBC-MDS / software-review-2021