Submission: eda_util_py (Python)

Package Info

Submitting Author:

Chuang(Frank) Wang (@chuangw46)
Fatime Selimi (@fatse)
Jiacheng Wang (@wangjc640)
Micah Kwok (@micahkwok)

Package Name: eda_utils_py One-Line Description of Package: Fast way of dealing with outlier and missing values, scaling, and correlation visualization. Repository Link: eda_utils_py Version submitted: 0.1.29 Editor: TBD Reviewer 1: TBD Reviewer 2: TBD Archive: TBD Version accepted: TBD

DESCRIPTION

As data rarely comes ready to be used and analyzed for machine learning right away, this package aims to help speed up the process of cleaning and doing initial exploratory data analysis (EDA). The package focuses on the tasks of dealing with outlier and missing values, scaling, and correlation visualization.

The four functions contained in this package are as follows:

imputer(): A function to impute missing values
outlier_identifier(): A function to identify and deal with outliers
cor_map(): A function to plot a correlation matrix of numeric columns in the dataframe
scale() A function to scale numerical values in the dataset

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [x] Data visualization
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Exploratory data analysis

*Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package falls under Data visualization and Exploratory data analysis because it plots the correlation matrix map to the given data. The package also falls under Data munging is because it re-shapes the raw numeric data into balanced scaled data, and eliminates outliers and missing data.

Who is the target audience and what are the scientific applications of this package?

The target audience would be anyone who needs a quick outlier and missing data removal performance on their raw data set. As well as those who need to scale the data frame or plot correlation matrix in order to prepare for a machine learning task on the Python language platform.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are several existing packages such as scikit-learn and pandas that have implementation with similar functionality. Such as sklearn.preprocessing.StandardScaler() and sklearn.preprocessing.Imputer(). However, our scaler and imputer differ by taking a string as the input for choosing the scaling method, so the user can get a more clear picture of what specific method used to perform such tasks. Our package also equipped a correlation map plotting function that can help users to explore the correlation between each column in the dataframe without importing an extra package. The outlier detection function could also tell the user where the outlier exists in the dataframe besides remove them or replace them with values that have more sense.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[ ] has an OSI-approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an obvious research application according to JOSS's definition in their [submission requirements][submit]. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][submit]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a paper.md matching [JOSS's requirements][jr] with a high-level description in the package root or in inst/. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[ ] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5hrs

Review Comments

Hey Team!

Great job on this package. It looks like you put a lot of work in and I like how the package as whole has a very consistent theme. Here's a few comments from me below. It's divided into a functionality and documentation section.

Functionality:

Installation: I couldn't seem to install the package based on your installation instructions. I think you may have to update your installation instructions to this: $ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple eda_utils_py. I tested that command and it now works.
Testing: It's really great to see that you have 100% test coverage. I tested some edge cases such as entering the wrong data type, insuitable arguments, etc.. and everything I tested seems to work fine and return a helpful error message.
Plotting: I like the color scheme you used. It suits a correlation plot well and ensures that a zero correlation shows up neutrally.

Documentation:

It would have been nice to see all the authors of this package in the pyproject.toml file. It currently only has Chuang Wang listed as the author.
In the usage section of the README, it's probably more user friendly to explicitly list all the arguments of each function. (ie. def imputer(df, strategy="mean", fill_value=None): instead of imputer(data_with_NA). This helps users understand the the imputation function uses the mean strategy by default. It also may have been nice to list all the potential arguments that a user can use (ie. mean, median, etc...). I didn't realize the function was so comprehensive until I looked into the function parameters.
In your function documentation in eda_utils_py.py it looks like you switch between calling the function parameters as df and dataframe (as one example). More consistency in naming convention between the functions would help your package feel more consistent. I've noticed this also in your code comments. Some areas are more closely documented than others.
A more in depth discussion on the ecosystem may help users better understand what exactly is the difference between this package and the existing ones and why they should use yours. I think you actually did this in the submission template and can probably copy and paste from there.

Let me know your thoughts and if anything needs clarification!

Heidi

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc) - N/A
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[ ] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hours

Review Comments

Hi Team,

Great job on building the package. The hard work shows in putting up such a clean and consistent package in such a short amount of time. Below are some minor points of feedback that I hope help improve the package further.

The current installation link in the README gives an error. I tried the following link and that seemed to work: pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple eda_utils_py
Under the Usage section of the README, since most of the outputs were dataframe transformations, I found myself scrolling back and forth to the example dataframes to see what exactly had changed. It might be worth highlighting the change either through a visual or in a short explanation. The exact transformations became clear to me once I read the function documentation, which were written very well.
I was unable to find author contact details, you could add this in the Contributing section of the README by hyperlinking your names to your github accounts/and or email addresses.
The links in the CONTRIBUTING.rst are broken, I think you want to replace those links with this - https://github.com/UBC-MDS/eda_utils_py/issues
It would be good to add all authors in the pyproject.toml file, like you have in the License file.

I looked through the code and ran all the functions, which all worked as advertised. I thought the code was well written, with good docstrings and inline code comments. Testing was thorough, proven by the 100% code coverage. I couldn't find anything code-related to feedback on. Great job :)

Let me know if anything needs clarification.

Cheers, Nash

UBC-MDS / software-review-2021