Submission: pyedahelper (Python)

Submitting Author: Ofer Mansour(@ofer-m), Suvarna Moharir(@suvarna-m), Subing Cao(@scao1 ), Manuel Maldonado (@manu2856 ) Package Name: pyedahelper One-Line Description of Package: A Python package that simplifies up the main EDA procedures such as: outlier identification, data visualization, correlation, missing data imputation. Repository Link: https://github.com/UBC-MDS/pyedahelper Version submitted: 0.1.13 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Sarah Weber(@sweber15) Reviewer 2: Jarvis Nederlof (@jnederlo) Archive: TBD
Version accepted: TBD

Description

Data understanding and cleaning represents 60% of data scientist's time in any given project. The goal with this package is to simplify this process , and make a more efficient use of time while working on some of the main procedures done in EDA (outlier identification, data visualization, correlation, missing data imputation).

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [X] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

The goal with this package is to simplify the process of data understanding and cleaning , and make a more efficient use of time while working on some of the main procedures done in EDA (outlier identification, data visualization, correlation, missing data imputation).

Who is the target audience and what are scientific applications of this package?

Our target audiences are those who want to have a quick understanding of their data by doing data cleaning and visualization. Our software provides an efficient and user-friendly solution for EDA analysis.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

At this time, there are several packages that are used during EDA with a similar functionality in both Python. Nevertheless most of these existing packages require multiple steps or provide results that could be simplified. In the redahelper package, the focus is to minimize the code a user uses to generate significant conclusions in relation to: outliers, missing data treatment, data visualization, correlation computing and visualization.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has an OSI approved license
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[X] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 4

Review Comments

I really like the idea of this package. The functionality was well thought out and is definitely useful.
All of the checks passed, and your tests ran smoothly. There is really good coverage regarding the tests, with only one block registering as not tested. The edge cases seem to have been thoroughly thought out.
The README followed the PyOpenSci best practices and was well laid out.
The project is well documented with plenty of examples, and is hosted properly on readthedocs.
The continuous integration flows well and adequately tests multiple operating systems and python versions.
I really liked that the functions all started with the same word ("fast"). I think this is very helpful for usability.

Feedback Suggestions

There was really good test coverage, but I generally have some comments regarding the tests:
- Try to avoid using asserts. Instead raise exception errors pertaining to the type of the error (some of the functions did this a little bit, but the functionality could be extended). For example, ValueError for improper values or TypeError for improper types. Raising exceptions with the proper error type helps a user identify their error.
- Try to add in helper functions to test user inputs. A lot of the asserts in each of the functions were checking the same thing. To clean up your code for readability and efficiency put some of these checks into helper functions. The helper functions could then be located in a helper sub-directory in the pyedahelper directory.
When working with a team, and especially when considering a code review, try to include comments throughout the code. This will help any users/contributors to better understand what your functions are doing. The project had some code comments but their use was inconsistent.
There were 2/3 different plotting libraries used. seaborn, matplotlib and altair. seaborn and matplotlib work in tandem meaning there were really 2 libraries used (altair and matplotlib) where one could have been sufficient. Consider using either matplotlib or altair. This will cut down on the dependencies, and will unify your functions.
The documentation is really thorough and generally well structured, but a few small suggestions include:
- There is a formatting error for two of the functions on readthedocs. The fast_corr and fast_outlier_id functions have some miss-formatting on the "parameters" sections.
- The usage examples, while comprehensive, could be extended a little bit for the plotting functions. When I went to run the examples I found that they would only show the plots if I ran them in a jupyter notebook. Consider making a mention of this, and/or extend the usage examples further to show how the user would plot the chart object from the python interpreter (i.e. by calling plot or whatever function is needed).
Generally speaking, these functions could benefit from helper functions. For example, the altiar plotting function could include a wrapper function to call each of the three plots. Additionally, instead of initializing many variable, consider using a data structure like a dictionary to hold the variables (i.e. in fast_outlier_id).

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3

Review Comments

The concept was really useful. I especially like the finding and handling of outliers and nans. I can see this being really helpful and saving time.
Installation was smooth, you use a lot of new packages that I had to update, but it was outlined clearly in the dependencies.
You clearly laid out what was the use and how to use your functions. The example on your README was really helpful. Your tests were mostly clearly documented and covered edge cases. I used a lot of random inputs and most were successfully defended against. See "Feedback Suggestions" below for improvements.
All your buttons passed including the build, codecov with 96%, release and docs.
The relation to other packages on your README was really well done.

Feedback Suggestions:

On your README you state: "Data understanding and cleaning represents 60% of data scientist's time in any given project.". It would be great if you can cite this as it is a powerful statement why your package is needed.
Spell check the README because there are some minor grammar mistakes:

There is random spaces around a comma: The goal with this package is to simplify this process , and make a more efficient. There were a few commas used instead of periods at the end of sentences.
Dataframe is referred as data frame and dataframe which is inconsistent.
analyzed is spelled "anlyzed" in the Functions section.

Most error messages I got were helpful like throwing an error when not putting in a dataframe. An error message that could use improvement is the on for the fast_missing_impute function. I put in a random method (pyedahelper.fast_missing_impute(df=sample_data, method="none", cols=["col_a, col_b"]) and got AssertionError: Not a valid method!. The error does tell me what is wrong, but I wasn't sure what to put in. I think having an error message like Not a valid method! Change method to "mean", "median" or "mode" or the other methods you accept. You do this for the fast_plot function for plot_type when it is not one of your accepted types.
The test function test_response_fast_outliers_id for in the test_fast_outlier_id.py is missing a doc string or any documentation.
Your fast_corr function with one column name produced an appropriate error message, however, testing it with one numeric and one non-numeric column produced a blank plot. You should include an additional test if only a numeric and non-numeric column name is inputted.
The fast_plot function should have a title in the output plot. You should be able to build one using f-string usage (f"{plot type}"). You can read more on Real Python
The Read the Docs documentation for the fast_corr and fast_outlier_id function are not being rendered correctly from your doc strings. The fast_corr may be due to using "Arguments" instead of "Parameters". It renders fine using ?pyedahelper.fast_corr in Python.

Hi @jnederlo , thank you very much for your review. We have addressed the fourth point in your suggestion. Please free free to track the changes at https://github.com/UBC-MDS/pyedahelper/pull/73. You can find the new release of our package here. Thank you for helping us improve our package!

Hi @sweber15 , thank you very much for your review. We have addressed the first three points in your suggestion. Please free free to track the changes at https://github.com/UBC-MDS/pyedahelper/pull/72. You can find the new release of our package here. Thank you for helping us improve our package!

UBC-MDS / software-review