Submission: pyxplr (Python)

Submitting Author: Serhiy Pokrovskyy(@pokrovskyy ), Braden Tam(@bradentam ), Furqan Khan(@fkhan72 ), Yu Fang(@lori94) Package Name: pyxplr One-Line Description of Package: pyxplr is a Python package to perform explanatory data analysis (EDA) simple and seamless. Repository Link: https://github.com/UBC-MDS/pyxplr Version submitted: v0.3.8 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Haoyu Su (@clsu22)
Reviewer 2: Sirine Chahma (@sirine-chahma)
Archive: TBD
Version accepted: TBD

Description

Include a brief paragraph describing what your package does:

pyxplris a python package to make exploratory data analysis (EDA) simple and seamless. EDA is a crucial phase of the data science workflow as it allows us get a fist glimpse of the data. It is important to identify statistical characteristics of the data so that researchers can properly set up the rest of the analysis. This package will provide the tools required to conduct a thorough EDA.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [x] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

The functions in our package can help generate a visualization for the data and give essential information about the data for EDA.

Who is the target audience and what are scientific applications of this package?

Researchers that are conducting the EDA.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Not sure yet.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

2 hours

Review Comments

Great package! The README is very easy to understand, the examples you used show clearly the purpose of your functions. Your package could help a lot of people!

Here are a few ideas to maybe improve your package :

I think in the docstrings of your functions, in the Example part, you are not supposed to put the output the function. I would delete the output of the examples in the explore_missing and in the explore_summary functions.
Some of your examples don't render well in the documentation that is in the readthedocs website. Try to add >>> at the beginning of the lines of code that are your examples.
You should add docstrings to your test_explore_summary function. Moreover, I am not sure if comments work as docstrings, maybe you should try to have proper docstrings for the functions that are in the test_explore_feature_map.py file too.
You should modify the explore_summary function so that it raises an error when the input is not a data frame. Then you should add a test in the test_explore_summary function to assert that your function throws an error if the input is not a data frame.
There is a typo in the explore_summary.py file, line 53, you wrote radom = random = ...
Still in the same function (explore_summary), you could use the map function, and then a for loop, to avoid having several times the same line of code with the only element that changes being the function that you apply. See here to see an example of what I mean by using the map function.
Extra idea : For your explore_summary function, it would be great if it could take a function as an argument (for example an anonymous function), and then this function would be applied to every column of the data frame. For example, imagine I want to count the number of negative values in each column of my data frame, it would be really cool if I could use lambda x : np.sum(x<0) or something like this, in your function, and then there would be a new row in the output of the explore_summary function that would display the number of negative values in each column.

I hope you will find my review useful! Please, let me know if you have any questions/comment on my feedbacks

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[ ] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

2h

Review Comments

Overall, well done. This package meets all the expectations listed in the project's objectives. All functions are useful to reduce time in EDA!

explore_summary It is convenient and intuitive to directly distinguish and return the types of variables. Something should be improved is the name of descriptive statistics in your returned table. I think using percentile percentage as the name would be more straightforward such as 25% and 75% instead of '1st' and '3rd'. Also, it would be better to return values count for each category for categorical variables. Moreover, should raise an error like "input data type must be pandas.DataFrame" when input is not a data frame.
explore_outliers Nice function to detect outliers but it would be better if available value ranges could be indicated and the exact values of outliers could be listed at the same time, not just the count. In addition, I think you should clarify that your function is only for numeric variables instead of all variables and you drop NA in the process of calculation in your docstrings and function description.
explore_missing Also a useful function. I think it would be better if an empty data frame returned when there are no missing values instead of raising an error. Also, this function can't identify empty strings (i.e. "") which are also missing values for categorical variables. For your num_row argument, there is no error raised when the value is out of scope or negative and when the input value is not an integer. For the data argument, it would be better if it could be changed to df to keep consistent with the argument in other functions.
explore_feature_map Nice and clear plots! This function is useful for numeric variables but seems to do nothing with categorical variables. I think you should clarify this in your function description. What's my suggestion is that you could also include categorical variables and use ANOVA test statistics or p-value to show the correlation between numeric variables and categorical variables. Also could do the chi-square test to find the correlation between two categorical variables. In addition, wrong input type for the dataframe argument should be a TypeError instead of a ValueError.

I hope my suggestions would be helpful for your future modification.

@clsu22 & @sirine-chahma, thanks for the time you spent reviewing our package!

@clsu22, most of your feedback has been addressed. The only item that is still being discussed is regarding the categorical values handling for the explore_feature_map function. We left it as an open-ended issue here https://github.com/UBC-MDS/pyxplr/issues/67

@sirine-chahma, your feedback has been addressed except the last two items that are still in progress. You can find the issue with current progress here https://github.com/UBC-MDS/pyxplr/issues/57

Thank you!

UBC-MDS / software-review