Submission: datascience_eda (Python)

Submitting Author: Mai Le (@lephanthuymai) Other Authors: Aditya Bhatraju(@adibns), Charles Suresh (@charlessuresh), Rahul Kuriyedath (@rahulkuriyedath) Package Name: datascience_eda One-Line Description of Package: This package includes functions handling various common tasks during the exploratory data analysis stage of a data science project. Repository Link: https://github.com/UBC-MDS/datascience_eda Version submitted: https://github.com/UBC-MDS/datascience_eda/releases/tag/milestone4 Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Description

This package includes functions assisting data scientists with common tasks during the exploratory data analysis stage of a data science project. Its functions will help the data scientist to do preliminary analysis on common column types like numeric columns, categorical columns, and text columns; it will also conduct several experimental clusterings on the dataset.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [x] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

datascience.eda package provides functions to automate most of the preliminary exploratory data analysis tasks, extract useful insights of the dataset and generate plots to visualize the findings.

Who is the target audience and what are the scientific applications of this package?

The target audience of this package is data scientists, it will help to improve the efficiency of the EDA process.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are various Python packages providing functions to be used in EDA, most of them focus on identifying the anomalies of numeric columns, the exact functionalities vary from ours, furthermore, there is no EDA-related package in our awareness that provides functions to handle text columns and data clustering.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Assigning @zmerpez & @YikiSu as reviewers.

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3 hours

Review Comments

Installation

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. spacymoji 2.0.0 requires spacy<3.0.0,>=2.1.3, but you have spacy 3.0.5 which is incompatible. shap 0.37.0 requires slicer==0.0.3, but you have slicer 0.0.7 which is incompatible. Successfully installed catalogue-2.0.1 datascience-eda-0.1.6 numpy-1.19.5 pathy-0.4.0 pydantic-1.7.3 seaborn-0.11.1 sklearn-0.0 smart-open-3.0.0 spacy-3.0.5 spacy-legacy-3.0.1 srsly-2.4.0 textblob-0.15.3 thinc-8.0.2 typer-0.3.2 wasabi-0.8.2 wordcloud-1.8.1 yellowbrick-1.3.post1

This should be solved by changing the package version in the toml file.

Usage

When I tried to check your package usage using the usage section on your README page, I was not able to load the data and run the function calls. I would highly recommend to include the link to the test file "/data/menu.csv" to your README. It would also be greate if the following lines are added to your usage section:

from sklearn.pipeline import make_pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler from sklearn.compose import make_column_transformer

I was not able to properly import your package as suggested in the usage section, but it works after I tried: import datascience_eda.datascience_eda as eda. I googled and found out that it is probably because we didn't put the functions in the init.py. I guess my group would have the same problem as well. But after updating this import line, I could access the functions in your package.
Since some of your functions generate a lot of plots in one function, when I tried to run it in my JupyterLab, it initially refused to show the plots. But it works after I put this in: import matplotlib as plt plt.rcParams.update({'figure.max_open_warning': 0}) It would be nice if you could include this in your usage section.

General Comments

Overall, it is a very cool package! Very well done, group datascience_eda. You made a great package to do some throughout eda to datasets and provides some insightful information. It will come in very handy for some of my projects! Good job!

Package Review

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all user-facing functions
[X] Examples for all user-facing functions
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[X] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[X] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[X] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[X] Installation instructions
[X] Any additional setup required (authentication tokens, etc)
[X] Brief demonstration usage
[X] Direction to more detailed documentation (e.g. your documentation files or website).
[X] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[X] Citation information

Usability

[X] The documentation is easy to find and understand
[X] The need for the package is clear
[X] All functions have documentation and associated examples for use

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[X] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2 Hours

Review Comments

Install The line provided in README worked well for me. I just need to import as Yiki suggested above by import datascience_eda.datascience_eda as eda.

Usage I think, your main functions does not need imputing and hence sklearn functions at start. I have tried from palmerpenguins import load_penguins and penguin_df = load_penguins(), all functions worked fine on this without the imputation. That might decrease the number of functions in your package.

Note In usage, after transforming the resulting df would only have the numerical columns, so using other 3 functions on this dataframe would give error. You can either add "passthrough", or use the original data frame as parameter in your functions.

Overall, I liked the visualizations, I might prefer the output of explore_categorical_columns to be something a bit easier to read, like a data frame. I have seen quite many functions, while expecting only 4, really good amount of work =) I wished, team to have better balance in the workload, given you have created such a big package.

UBC-MDS / software-review-2021