UBC-MDS / software-review-2021

1 stars 1 forks source link

Submission: eda_util_py (Python) #31

Open wangjc640 opened 3 years ago

wangjc640 commented 3 years ago

Package Info

Submitting Author:

Package Name: eda_utils_py One-Line Description of Package: Fast way of dealing with outlier and missing values, scaling, and correlation visualization. Repository Link: eda_utils_py Version submitted: 0.1.29 Editor: TBD Reviewer 1: TBD Reviewer 2: TBD Archive: TBD Version accepted: TBD


DESCRIPTION

As data rarely comes ready to be used and analyzed for machine learning right away, this package aims to help speed up the process of cleaning and doing initial exploratory data analysis (EDA). The package focuses on the tasks of dealing with outlier and missing values, scaling, and correlation visualization.

The four functions contained in this package are as follows:

Scope

*Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

The package falls under Data visualization and Exploratory data analysis because it plots the correlation matrix map to the given data. The package also falls under Data munging is because it re-shapes the raw numeric data into balanced scaled data, and eliminates outliers and missing data.

The target audience would be anyone who needs a quick outlier and missing data removal performance on their raw data set. As well as those who need to scale the data frame or plot correlation matrix in order to prepare for a machine learning task on the Python language platform.

There are several existing packages such as scikit-learn and pandas that have implementation with similar functionality. Such as sklearn.preprocessing.StandardScaler() and sklearn.preprocessing.Imputer(). However, our scaler and imputer differ by taking a string as the input for choosing the scaling method, so the user can get a more clear picture of what specific method used to perform such tasks. Our package also equipped a correlation map plotting function that can help users to explore the correlation between each column in the dataframe without importing an extra package. The outlier detection function could also tell the user where the outlier exists in the dataframe besides remove them or replace them with values that have more sense.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an obvious research application according to JOSS's definition in their [submission requirements][submit]. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][submit]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a paper.md matching [JOSS's requirements][jr] with a high-level description in the package root or in inst/. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

heidi-ye commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5hrs


Review Comments

Hey Team!

Great job on this package. It looks like you put a lot of work in and I like how the package as whole has a very consistent theme. Here's a few comments from me below. It's divided into a functionality and documentation section.

Functionality:

  1. Installation: I couldn't seem to install the package based on your installation instructions. I think you may have to update your installation instructions to this: $ pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple eda_utils_py. I tested that command and it now works.

  2. Testing: It's really great to see that you have 100% test coverage. I tested some edge cases such as entering the wrong data type, insuitable arguments, etc.. and everything I tested seems to work fine and return a helpful error message.

  3. Plotting: I like the color scheme you used. It suits a correlation plot well and ensures that a zero correlation shows up neutrally.

Documentation:

  1. It would have been nice to see all the authors of this package in the pyproject.toml file. It currently only has Chuang Wang listed as the author.

  2. In the usage section of the README, it's probably more user friendly to explicitly list all the arguments of each function. (ie. def imputer(df, strategy="mean", fill_value=None): instead of imputer(data_with_NA). This helps users understand the the imputation function uses the mean strategy by default. It also may have been nice to list all the potential arguments that a user can use (ie. mean, median, etc...). I didn't realize the function was so comprehensive until I looked into the function parameters.

  3. In your function documentation in eda_utils_py.py it looks like you switch between calling the function parameters as df and dataframe (as one example). More consistency in naming convention between the functions would help your package feel more consistent. I've noticed this also in your code comments. Some areas are more closely documented than others.

  4. A more in depth discussion on the ecosystem may help users better understand what exactly is the difference between this package and the existing ones and why they should use yours. I think you actually did this in the submission template and can probably copy and paste from there.

Let me know your thoughts and if anything needs clarification!

Heidi

nashmakh commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hours


Review Comments

Hi Team,

Great job on building the package. The hard work shows in putting up such a clean and consistent package in such a short amount of time. Below are some minor points of feedback that I hope help improve the package further.

I looked through the code and ran all the functions, which all worked as advertised. I thought the code was well written, with good docstrings and inline code comments. Testing was thorough, proven by the 100% code coverage. I couldn't find anything code-related to feedback on. Great job :)

Let me know if anything needs clarification.

Cheers, Nash