Submission: eda_analysis(Python)

Submitting Author: Gaurav Sinha (@sgauravm ), Cheng (Marvin) Min (@marvinmin ), Yi (James) Liu (@v5y8 ), Sarah Weber (@sweber15 )
Package Name: eda_analysis One-Line Description of Package: The package automates exploratory data analysis process by providing function that generates a general exploratory data analysis report Repository Link: https://github.com/UBC-MDS/edapython Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Mohammedalmojtaba Mohammed (@dataubc )
Reviewer 2: Tao Guo (@tguo9 )
Archive: TBD
Version accepted: TBD

Description

Exploratory Data analysis is an important step in any data analysis. There are some general steps like describing the data, knowing NA values and plotting the distributions of the variables which are performed to understand the data well. All these tasks require a lot of coding effort. The package tries to address this issue by providing a single function which will generate a general exploratory data analysis report. This report will contain the distribution plots of categorical and numerical variables, correlation matrix and a numerical representation to understand and identify NA values.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

This package provides a single function which generates a general exploratory data analysis report. The package simplifies EDA tasks that require a lot of coding effort like describing the data, knowing NA values and plotting the distributions of the variables which are needed to understand the data well.

Who is the target audience and what are scientific applications of this package?

The target audience for our package includes anyone that wants to understand a data set specifically including data scientists and data analysts. The scientific applications of this package are that users can perform a simplified EDA on a dataframe without the intense coding.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are other similar packages which can be used for EDA analysis. A package which does a similar thing is pandas profiling. Pandas profiling creates an HTML report, but our package will give output in the ongoing code.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussions, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 10 hours

Review Comments

Reviews based on: v1.2.0

Documentations(readthedocs)

Pros:
- Nice job for the documentations, great readthedocs pages. I have no trouble reading them,
Suggestions:
- There are couple inconsistency issues for calc_cor and generate_report.The >>> is not correctly highlighted.
- For describe_na_values, used Pandas.DataFrame and rest of functions used (pandas.DataFrame. Probably change it just for consistence.
Functionalities
1. calc_cor
  
  Pros:
  - Great ideas of having this correlation matrix. I saw similar in R, but in Python. I think this one is a really cool feature here.
  - Nice graph, clearly labelled graph, and great color choices.
  Suggestions:
  - If the numbers are large, the numbers will be stacked on top of each other. It will be great if you can fix it.
  - Please fix the example in the docstring, it cannot run.

Screen Shot 2020-03-22 at 10 10 39 AM

2. `describe_cat_var`

    Pros: 
    > - Great histogram. Great labelling, clear format.
    > - I like the idea of making the histogram for a categorical variable since I always forgot to scale the axis.

    Suggestions:
    > - For the count, the y-axis, should be discrete instead of continuous. 
    > - Please fix the example in the docstring, it cannot run, the function name is different.

3. `describe_na_values`

    Pros: 
    > - Great ideas of getting the NA values from a dataframe. I remembered from R and I think this function is really useful for figuring out the missing data pattern.
    > - Nice function styles, well documented. Love it.

    Suggestions:
    > - The docstring example won't run, the function name changed. Also, the last two examples are passed in the same dataframe but got different results.
    > - For the result dataframe, only showing 0 and 1 as col name probably not very clear. I didn't try it, but I think if it is a very large dataframe, looking at 0s and 1s in the result probably isn't very clear. It will be great to have some summary result numbers.

Screen Shot 2020-03-22 at 11 35 04 AM

4. `describe_num_var`

    Pros: 
    > - Great detailed summaries. I really like this type of table, clear and clean report is definitely a plus.
    > - Nice plot, great visualizations, perfectly grid together.

    Suggestions:
    > - The docstring example won't run. Also, maybe add a bin adjustment argument, so users can adjust the bin themselves.
    > - For the count, the y-axis, should be discrete instead of continuous. 

5. `generate_report`

    Pros: 
    > - Great idea of putting everything together. I love this. Easy to use and really clear steps by steps.
    > - Nice separations between functions, so users can know what these steps did.

    Suggestions:
    > - The docstring example won't run. Also, the example is not for this function.
    > - Return a boolean is not really clear for the user, maybe a message will be better.

General Comments
1. Great job! Everything works fine for me.
2. Clear and detailed documentation.
3. Nice highlights of self-merge PR issues.
4. It will be great to include all contributors in the README
5. Probably make the Github Action Versioning and the lastest versions the same.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).