UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: pyedahelper (Python) #26

Open scao1 opened 4 years ago

scao1 commented 4 years ago

Submitting Author: Ofer Mansour(@ofer-m), Suvarna Moharir(@suvarna-m), Subing Cao(@scao1 ), Manuel Maldonado (@manu2856 ) Package Name: pyedahelper One-Line Description of Package: A Python package that simplifies up the main EDA procedures such as: outlier identification, data visualization, correlation, missing data imputation. Repository Link: https://github.com/UBC-MDS/pyedahelper Version submitted: 0.1.13 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Sarah Weber(@sweber15) Reviewer 2: Jarvis Nederlof (@jnederlo) Archive: TBD
Version accepted: TBD


Description

Data understanding and cleaning represents 60% of data scientist's time in any given project. The goal with this package is to simplify this process , and make a more efficient use of time while working on some of the main procedures done in EDA (outlier identification, data visualization, correlation, missing data imputation).

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

The goal with this package is to simplify the process of data understanding and cleaning , and make a more efficient use of time while working on some of the main procedures done in EDA (outlier identification, data visualization, correlation, missing data imputation).

Our target audiences are those who want to have a quick understanding of their data by doing data cleaning and visualization. Our software provides an efficient and user-friendly solution for EDA analysis.

At this time, there are several packages that are used during EDA with a similar functionality in both Python. Nevertheless most of these existing packages require multiple steps or provide results that could be simplified. In the redahelper package, the focus is to minimize the code a user uses to generate significant conclusions in relation to: outliers, missing data treatment, data visualization, correlation computing and visualization.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

jnederlo commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 4


Review Comments

Feedback Suggestions

sweber15 commented 4 years ago

Package Review

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 3


Review Comments

Feedback Suggestions:

  1. On your README you state: "Data understanding and cleaning represents 60% of data scientist's time in any given project.". It would be great if you can cite this as it is a powerful statement why your package is needed.

  2. Spell check the README because there are some minor grammar mistakes:

  1. Most error messages I got were helpful like throwing an error when not putting in a dataframe. An error message that could use improvement is the on for the fast_missing_impute function. I put in a random method (pyedahelper.fast_missing_impute(df=sample_data, method="none", cols=["col_a, col_b"]) and got AssertionError: Not a valid method!. The error does tell me what is wrong, but I wasn't sure what to put in. I think having an error message like Not a valid method! Change method to "mean", "median" or "mode" or the other methods you accept. You do this for the fast_plot function for plot_type when it is not one of your accepted types.

  2. The test function test_response_fast_outliers_id for in the test_fast_outlier_id.py is missing a doc string or any documentation.

  3. Your fast_corr function with one column name produced an appropriate error message, however, testing it with one numeric and one non-numeric column produced a blank plot. You should include an additional test if only a numeric and non-numeric column name is inputted.

  4. The fast_plot function should have a title in the output plot. You should be able to build one using f-string usage (f"{plot type}"). You can read more on Real Python

  5. The Read the Docs documentation for the fast_corr and fast_outlier_id function are not being rendered correctly from your doc strings. The fast_corr may be due to using "Arguments" instead of "Parameters". It renders fine using ?pyedahelper.fast_corr in Python.

scao1 commented 4 years ago

Hi @jnederlo , thank you very much for your review. We have addressed the fourth point in your suggestion. Please free free to track the changes at https://github.com/UBC-MDS/pyedahelper/pull/73. You can find the new release of our package here. Thank you for helping us improve our package!

scao1 commented 4 years ago

Hi @sweber15 , thank you very much for your review. We have addressed the first three points in your suggestion. Please free free to track the changes at https://github.com/UBC-MDS/pyedahelper/pull/72. You can find the new release of our package here. Thank you for helping us improve our package!