UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: pyxplr (Python) #34

Open lori94 opened 4 years ago

lori94 commented 4 years ago

Submitting Author: Serhiy Pokrovskyy(@pokrovskyy ), Braden Tam(@bradentam ), Furqan Khan(@fkhan72 ), Yu Fang(@lori94) Package Name: pyxplr One-Line Description of Package: pyxplr is a Python package to perform explanatory data analysis (EDA) simple and seamless. Repository Link: https://github.com/UBC-MDS/pyxplr Version submitted: v0.3.8 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Haoyu Su (@clsu22)
Reviewer 2: Sirine Chahma (@sirine-chahma)
Archive: TBD
Version accepted: TBD


Description

pyxplris a python package to make exploratory data analysis (EDA) simple and seamless. EDA is a crucial phase of the data science workflow as it allows us get a fist glimpse of the data. It is important to identify statistical characteristics of the data so that researchers can properly set up the rest of the analysis. This package will provide the tools required to conduct a thorough EDA.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

The functions in our package can help generate a visualization for the data and give essential information about the data for EDA.

Researchers that are conducting the EDA.

Not sure yet.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

sirine-chahma commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:

2 hours


Review Comments

Great package! The README is very easy to understand, the examples you used show clearly the purpose of your functions. Your package could help a lot of people!

Here are a few ideas to maybe improve your package :

I hope you will find my review useful! Please, let me know if you have any questions/comment on my feedbacks

clsu22 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:

2h

Review Comments

Overall, well done. This package meets all the expectations listed in the project's objectives. All functions are useful to reduce time in EDA!

  1. explore_summary It is convenient and intuitive to directly distinguish and return the types of variables. Something should be improved is the name of descriptive statistics in your returned table. I think using percentile percentage as the name would be more straightforward such as 25% and 75% instead of '1st' and '3rd'. Also, it would be better to return values count for each category for categorical variables. Moreover, should raise an error like "input data type must be pandas.DataFrame" when input is not a data frame.
  2. explore_outliers Nice function to detect outliers but it would be better if available value ranges could be indicated and the exact values of outliers could be listed at the same time, not just the count. In addition, I think you should clarify that your function is only for numeric variables instead of all variables and you drop NA in the process of calculation in your docstrings and function description.
  3. explore_missing Also a useful function. I think it would be better if an empty data frame returned when there are no missing values instead of raising an error. Also, this function can't identify empty strings (i.e. "") which are also missing values for categorical variables. For your num_row argument, there is no error raised when the value is out of scope or negative and when the input value is not an integer. For the data argument, it would be better if it could be changed to df to keep consistent with the argument in other functions.
  4. explore_feature_map Nice and clear plots! This function is useful for numeric variables but seems to do nothing with categorical variables. I think you should clarify this in your function description. What's my suggestion is that you could also include categorical variables and use ANOVA test statistics or p-value to show the correlation between numeric variables and categorical variables. Also could do the chi-square test to find the correlation between two categorical variables. In addition, wrong input type for the dataframe argument should be a TypeError instead of a ValueError.

I hope my suggestions would be helpful for your future modification.

pokrovskyy commented 4 years ago

@clsu22 & @sirine-chahma, thanks for the time you spent reviewing our package!

@clsu22, most of your feedback has been addressed. The only item that is still being discussed is regarding the categorical values handling for the explore_feature_map function. We left it as an open-ended issue here https://github.com/UBC-MDS/pyxplr/issues/67

@sirine-chahma, your feedback has been addressed except the last two items that are still in progress. You can find the issue with current progress here https://github.com/UBC-MDS/pyxplr/issues/57

Thank you!