UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: eda_analysis(Python) #41

Open sgauravm opened 4 years ago

sgauravm commented 4 years ago

Submitting Author: Gaurav Sinha (@sgauravm ), Cheng (Marvin) Min (@marvinmin ), Yi (James) Liu (@v5y8 ), Sarah Weber (@sweber15 )
Package Name: eda_analysis One-Line Description of Package: The package automates exploratory data analysis process by providing function that generates a general exploratory data analysis report Repository Link: https://github.com/UBC-MDS/edapython Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Mohammedalmojtaba Mohammed (@dataubc )
Reviewer 2: Tao Guo (@tguo9 )
Archive: TBD
Version accepted: TBD


Description

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

This package provides a single function which generates a general exploratory data analysis report. The package simplifies EDA tasks that require a lot of coding effort like describing the data, knowing NA values and plotting the distributions of the variables which are needed to understand the data well.

The target audience for our package includes anyone that wants to understand a data set specifically including data scientists and data analysts. The scientific applications of this package are that users can perform a simplified EDA on a dataframe without the intense coding.

There are other similar packages which can be used for EDA analysis. A package which does a similar thing is pandas profiling. Pandas profiling creates an HTML report, but our package will give output in the ongoing code.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

tguo9 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 10 hours


Review Comments

Reviews based on: v1.2.0

Screen Shot 2020-03-22 at 10 10 39 AM

2. `describe_cat_var`

    Pros: 
    > - Great histogram. Great labelling, clear format.
    > - I like the idea of making the histogram for a categorical variable since I always forgot to scale the axis.

    Suggestions:
    > - For the count, the y-axis, should be discrete instead of continuous. 
    > - Please fix the example in the docstring, it cannot run, the function name is different.

3. `describe_na_values`

    Pros: 
    > - Great ideas of getting the NA values from a dataframe. I remembered from R and I think this function is really useful for figuring out the missing data pattern.
    > - Nice function styles, well documented. Love it.

    Suggestions:
    > - The docstring example won't run, the function name changed. Also, the last two examples are passed in the same dataframe but got different results.
    > - For the result dataframe, only showing 0 and 1 as col name probably not very clear. I didn't try it, but I think if it is a very large dataframe, looking at 0s and 1s in the result probably isn't very clear. It will be great to have some summary result numbers.

Screen Shot 2020-03-22 at 11 35 04 AM

4. `describe_num_var`

    Pros: 
    > - Great detailed summaries. I really like this type of table, clear and clean report is definitely a plus.
    > - Nice plot, great visualizations, perfectly grid together.

    Suggestions:
    > - The docstring example won't run. Also, maybe add a bin adjustment argument, so users can adjust the bin themselves.
    > - For the count, the y-axis, should be discrete instead of continuous. 

5. `generate_report`

    Pros: 
    > - Great idea of putting everything together. I love this. Easy to use and really clear steps by steps.
    > - Nice separations between functions, so users can know what these steps did.

    Suggestions:
    > - The docstring example won't run. Also, the example is not for this function.
    > - Return a boolean is not really clear for the user, maybe a message will be better.
dataubc commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

I was pleased to have the opportunity to review this package. Overall, I would rank this package among the top 5, I will be using from now on for my data exploratory analysis, with a little effort it could be all I need!. All functions are well-documented, the installation went smoothly, and the examples were very helpful. The readme covers all the needed information about the package.

Here are a few suggestions that you might find useful for future improvements.

Order of EDA: The EDA implemented by eda_analysisis initiated by giving statistic of the NA in the data frame, which is a useful introductory step. However, I would prefer the EDA to start with looking at my data, for example, the head or the tail. Next, I would put the summary of the data and not the histograms.

Histograms: To check the functionality of the package I used credit-default-data that could be publically accessed. The produced histograms look nice and are correct. One minor thing is that need to be adjusted is the bins used. For example, in the credit-default-data, a couple of plots had bars that tend to go outside the plot or merge together as you could see below.

Histograms

Binning is also problematic for the numerical variables, all plots were forced to have the same axis which is not always appropriate, for example, age and Limit Balance in the credit-default-data don’t have the same range.

Numerical variable summary: Looks nice, but could be better formatted.

Correlation plot for numerical variable This is a very useful plot but looked very jammed in credit-default-data maybe this could be improved by reducing the font size of the correlation value since a colour label has already been provided.

plot_correlations

Edge cases: So I tried to replace values in credit-default-data with ‘NA’ and ‘nan’, but for some reason, these were not picked by the table which shows the number of ‘NA’.

Minor suggestions: 1- Returning True at the end of the analysis is not very informative for the user, I had to go and read the docs of the function to know what it means, consider removing it. 2- Would be nice if the fonts used in this report are consistent.

Nice to have: Things that could be added in the future : 1- Exploring relationships between categorical and numeric variables 2- Including Boxplots

In summary, this package great but could be awesome, by applying minor modifications it will be the package to go to make reproducible EDA and publishable reports. Nice work.

sgauravm commented 4 years ago

Thanks a lot for reviewing the package.There were a lot of useful comments provided by the reviewers. Some of them we selected and addressed are :

  1. Fixing inconsistency in the docstring and docstring example
  2. Allowing the plots to have different axes range for the function describe_num_var
  3. Decreasing the font size label of the correlation matrix

This is the link of the latest release in which all the above issues has been fixed: eda_analysis-v1.3.0