Submission Group 15 - snapedautility(Python)

name: snapedautility about: Python package for peer review title: 'snapedautility' labels: 1/editor-checks, New Submission! assignees: ''

Submitting Author: Name (@AraiYuno)
Package Name: snapedautility One-Line Description of Package: snapedautility is an open-source library that generate useful function to kickstart EDA (Exploratory Data Analysis) with just a few lines of code. Repository Link: https://github.com/UBC-MDS/snapedautility Version submitted: v2.0.0 Editors

@AraiYuno
@dol23asuka
@harryyikhchan

Reviewers

Reviewer 1: @iamMoid
Reviewer 2: @artanzand
Reviewer 3: @gfairbro
Reviewer 4: @rezam747

Archive: TBD
Version accepted: v2.0.0

Description

Include a brief paragraph describing what your package does: Description: snapedautilityR is an open-source library that generates useful function to kickstart EDA (Exploratory Data Analysis) with just a few lines of code. The system is built around quickly analyzing the whole dataset and providing a detailed report with visualization. Its goal is to help quick analysis of feature characteristics, detecting outliers from the observations, and other such data characterization tasks.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences): This package offers utility functions that provide the basic EDA visualizations including histograms, outlier detection, and correlation plot.
Who is the target audience and what are the scientific applications of this package?
Any data professionals at the entry-level who would like to conduct a quick exploratory data analysis.
Are there other Python packages that accomplish the same thing? If so, how does yours differ? Python packages such as pandasprofiling, sweetviz, and ExploriPy provide a more broad range of utility functions to conduct data extraction, transformation, and modeling in addition to exploratory data analysis. snapedautility aims to only concentrate on the EDA by keeping the package lightweight and less risky dependency conflicts.
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2 hours

Review Comments

Well done team! Snapedautility is a valuable addition to python graphics stack. It's main functions are easy to understand, useful, and the whole work is very natural and straightforward. I had almost no issues using it with my data, and everything works as promised.

Here are my general comments for improvement:

1 - I am positive that a lot of thought has gone into selecting this name for the package, but it took me a while to figure out how to read it. I understand that it is a combination of snap + eda + utility, but seeing from a user's lense typing snapedautility has a high potential for error. I am sure you want your users to quickly remember and type in the package name while importing the package and suggest using something that is shorter, uses hyphen or even you could go bold if all names were already taken and pick a name that is not directly related to your functions. For example, Pandas has nothing to do with pandas but everyone knows what they do!

2 - Although names and Github usernames were provided. I was not able to find your email addresses in order to enable contact. It would be nice to add it.

3 - In the "Features" section in README you mention that detect_outliers generates a violin plot that indicates the outliers that deviate from other observations on data. The example in ReadTheDocs and also your script suggests that this should be a boxplot. This should be an easy fix in the text. A general suggestion is to create a table where you would have one column with function names, one with the arguments each take and the last one a short description of what each function does. This could be a mini-map for your package.

4 - What I noticed in your ci-cd.yml is that you are not testing whether your package can be installed on Windows or mac machines. It will be necessary to add this to your Continuous Improvement section to make sure your software is compatible with every machine. Note: you don't need to run on different machines for the CD section.

5 - I receive many warnings when installing the package. I am not sure if these could be silenced. I had over 10 pairs of the below warnings.

WARNING: Ignoring invalid distribution -ywin32 (c:\users\artan\miniconda3\envs\pycounts\lib\site-packages)
WARNING: Ignoring invalid distribution -yrsistent (c:\users\artan\miniconda3\envs\pycounts\lib\site-packages)

6 - In your "Usage" section you are creating a dataframe of penguins_data, but the user is not able to replicate this. This assumes users familiarity with palmerpenguins. I recommend adding the script on how to install (from palmerpenguins import load_penguins) and then import the dataframe (something like the one you have done in your test scripts).

7 - In CONTRIBUTING.md under "Get Started" you are cloning from a repo that does not exist. You will need to replace it with git clone https://github.com/UBC-MDS/snapedautility.git.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[ ] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5

Review Comments

Well done team! This package is lightweight, functional and focused. the source is easy to understand and the functionality is straightforward but of great utility. Nice Work! between your excellent work and Artan's meticulous review above, it started to get hard to find flaws, but i do have a few improvements to suggest:

There is no detailed citation of other packages you have used in your utility such as altair and pandas. It might be considerate to credit those packages.
It you could consider including the inputted feature names in the title of your plots so that the titles are less generic.
Same idea for axis labels, you could think about applying title case and string replacement on the inputs (it wouldnt be perfect of course but could be pretty flexible).
the documentation of the detect outliers indicates a violin plot, but a box plot has been generated.
The examples from the docstrings don't really work as written. plot corr is not quoting the right names from penguins, plot histograms makes a slightly vague reference to the palmer penguins data, and the detect outliers could better demonstrate how to capture it's output. The example.ipynb is excellent however so it is a minor complaint.

Overall I consider these issues minor and would like to congratulate you again on a well executed development project.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5

Review Comments

Great job team! The main functions of snapedautility are easy e to understand and it was easy to install. The project has performed very well overall, but I do have a few suggestions for improvement :

1- In the README.md the doc badge link is not working and it just shows the picture of the badge. However, it needs to direct the user to the files deployed by readthedocs.

2- As your package is an EDA package, that would be better if you have added a few images from the output of your function in the usage section of README.md file so users would be able to see your output.

3- The name detect_outlier function does exactly represent what this function doing. In the documentation README.md you are mentioning It generates a violin plot; however, it generates a boxplot. And also this function is not detecting anything it is plotting so the name of the function is kind of misleading.

4- In the detect_outlier documentation, the lower bound based on the plot is more than 2000, but on top of the chart 1750 has been calculated as the lower band. Therefore, the plot and the numbers do not match that need to be fixed.

5-In the documentation the badges like ci-cd are not shown on the main page.

6- In the Usage section of README.md file, you are creating a dataframe of penguins_data, but this is not reproducible because you have not mentioned the following line of code to call the needed function for creating that dataframe "from palmerpenguins import load_penguins".

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5

Review Comments

Amazing job team. The package works as intended and described. I like how the name conveys the benefit of being able to generate plots swiftly. My fellow reviewers above have provided detailed feedback which I agree with. Additionally, I noticed a few minor edits as listed below:

In the README file, the badge for docs passing appears in square brackets [ ]. I believe this was no intended, should be an easy fix.
I noticed that the scripts import the numpy library, however, I do not see it being referenced anywhere in the code which makes sense as the plotting functions rely primarily on altair. I see this as unnecessary and suggest its removal.
The tests are very comprehensive and do a good job of testing required areas. I did notice in the test_plot_histograms.py script that a couple of functions do not have comments/docstrings while others do. I would suggest including brief docstrings to help understand the functions.
The USAGE section on the README file directly dives into the code blocks. I believe it will be useful to consolidate the Features and Usage sections such that the feature description is followed by the corresponding usage code block for better readability.
The examples in the function docstrings do not work as intended as we are not loading the sample penguins dataset.

Overall great effort team and kudos for 100% code coverage!

UBC-MDS / software-review-2022