Submitting Author:

Yihong (Hazel) Jiang(@HazelJJJ)
Mike Lynch (@mikelynch416)
Trevor Kinsey (@trevorki)
Sasha Babicki (@sbabicki)

Package Name: Kmeaningful One-Line Description of Package: Python package that contains functions to help with data preprocessing, hyperparameter tuning and visualizing clusters using the k-means algorithm. Repository Link: Kmeaningful Version submitted: 0.3.1 Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Description

Have you ever encountered a dataset that seems to have different patterns in it? Have you ever tried to group similar things together in a dataset and to assign a new sample based on your findings? Kmeaningful is a python package that uses the k-means algorithm to find clusters and assign new data points to them.

The function preprocess() automatic dataset preprocess.
The function fit_assign() finds centroid location for all of the clusters and assigns each example to a cluster
The function find_elbow() automatic hyperparameter tuning to select optional number of cluster
The function show_clusters() visualize clusters and centroid according to 2d PCA representation

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

Kmeaningful falls into data munging category because it intends to take raw data and perform some preprocessing to the data as well as help with finding cluster and the centroid of the cluster. In addition, it also returns a plot visualizing the results, so it falls into the data visualization category as well.
Who is the target audience and what are scientific applications of this package?

Target audience would be people who want to perform simple unsupervised learning to dataset using k-means method.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Python packages like the Scikit-learn sklearn.cluster.KMeans and SciPy scipy.cluster.vq.kmeans are doing similar things. We are not trying to break new ground with kmeaningful, but rather to build a simple and lightweight implementation from scratch.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x ] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x ] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x ] Installation instructions: for the development version of package and any non-standard dependencies in README
[x ] Vignette(s) demonstrating major functionality that runs successfully locally
[x ] Function Documentation: for all user-facing functions
[x ] Examples for all user-facing functions
[x ] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x ] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x ] The package name
[x ] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x ] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x ] Installation instructions
[x ] Any additional setup required (authentication tokens, etc)
[x ] Brief demonstration usage
[x ] Direction to more detailed documentation (e.g. your documentation files or website).
[x ] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x ] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x ] The documentation is easy to find and understand
[x ] The need for the package is clear
[x ] All functions have documentation and associated examples for use

Functionality

[x ] Installation: Installation succeeds as documented.
[x ] Functionality: Any functional claims of the software been confirmed.
[x ] Performance: Any performance claims of the software been confirmed.
[x ] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- test raised 2 warnings (see below)
[x ] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5

Review Comments

Here are some of the points that I wanted to just point out. Hopefully it is of help to you guys!

Installation: I copied the pip3 install code into my terminal however it came ran into an error. ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/Library/Python/3.8' Consider using the--useroption or check the permissions.
- Upon adding --user I was able to successfully install the package. I’m not too familiar with software end of things, so not sure if this is just a personal computer issue or not. If not, perhaps adding --user to the end of the installation code chunk would help avoid this issue. I also noticed it seemed to be linked to the 0.2.0 version, which may not be the latest release.
Pytest: After cloning the repo and running poetry install and poetry run pytest, I had 9 pass and 2 warnings. These are just warnings, but I wanted to bring them to your attention just in case. The warnings messages as follows:

-RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype,

-RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount)
README Examples: I think the example in the README would benefit from showing the output of what each function does. I think seeing it laid out in such a nice short code chunk is nice, however it is harder at a glance for a user to figure what the output might be. For example, showing the output array on preprocess() or the output plot on show_clusters() would be a nice little touch.
Readthedocs: The readthedocs is good and I was able to follow the installation instructions for cloning the repo. The module index section is good too, but the parameters and returns section seem to have been messed up a bit in the formatting.
fit_assign():
- I would consider hiding the helper functions for fit_assign(). It is really neat and impressive function that has a lot of components, but it may be better to not include the helper functions into the readthedocs as the user probably doesn’t need to know about those. However, if those functions are intended to be made known to the public I think the formatting as it stands is a bit cluttered (most likely due to the automatic rendering of readthedocs).
- To expand a bit on this, I thought the fit() and assign() functions within fit_assign() were simply helper functions, but in the usage section those functions are called separately so it seems to indicate these are not simply helper functions. Personally, I think for simplicity purposes it would be easier and in-line with your package to only allow users to call fit_assign(). However, I understand why you may choose otherwise.

Overall Comments

Great job team! I think your package is great. The README does a great job in giving a good overview of what the package does and the functions are tested, written defensively and run well. I really appreciated the thorough comments explaining the code. Overall, I think the package accomplishes the task you set out to do as it is simple to use and lightweight.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements

The package meets the readme requirements below:

[x] Package has a README.md file in the root directory. The README should include, from top to bottom:
[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine. test raised 2 warnings (see below)
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS The package has an obvious research application according to JOSS's definition in their submission requirements. Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software Authors: A list of authors with their affiliations A statement of need clearly stating problems the software is designed to solve and its target audience. References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5

Review Comments

Hey folks! Excellent work putting this package together. I was able to install the package using the pip command without any issues. Please find some feedback on the package below:

README Examples: I would echo @micahkwok comments about including the output of the each function in the README vignette. It would really help to be able to visualize your package's functionality through photos or gifs. This is especially true for the output of the show_clusters function.
Examples on readthedocs.io : The examples from readthedocs.io do not seem to be rendering properly on my end (i.e. the function from your package is on a separate line from the rest fo the examples). I am attaching a screenshot for your review.
find_elbow function: I wonder if it would be helpful to have the find_elbow function return an elbow/scree plot as well as the optimal K. This would be a functionality similar to the yellowbrick KElbowVisualizer. Sometimes detecting the elbow is subjective and the user may be in a better position to decide on the elbow based on the problem at hand. I do appreciate that the user can pass the optimal_K value of their choosing onto the fit_assign function; it makes the package more flexible.
Visualization: I realized that the cross marks denoting cluster centers hide circular data points. I wonder if it may be useful to be able to visualize both the data point and the cluster centers when the two are superimposed through changing the opacity of the crosses.
Code coverage badge: Just thought I would flag that your code coverage badge is not dynamically updating.

Overall Comments  Excellent work building a simple and lightweight implementation of k-means clustering from scratch. I think your package accomplishes the goals you had set for it. I thought the code was very well-written and the functions worked seamlessly. I especially appreciated that the preprocess function could handle multiple outputs(e.g. Pandas DataFrame or numpy array). Great work overall.

UBC-MDS / software-review-2021

Submission: Kmeaningful(Python) #33

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Code of conduct

Editor and Review Templates

Package Review

Documentation

Usability

Functionality

For packages co-submitting to JOSS

Final approval (post-review)

Review Comments

Package Review

Documentation

Readme requirements

Usability

Functionality

Final approval (post-review)

Review Comments