UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: PyDataPeek (Python) #8

Open alistair-clark opened 4 years ago

alistair-clark commented 4 years ago

Submitting Author: Monique Wong (@moniquewong), Alistair Clark (@alistair-clark) , Miro Hu (@mirohu), Thomas Pin (@MrThomasPin) Package Name: PyDataPeek One-Line Description of Package: Simple EDA for .csv or .xlsx documents Repository Link: Repo Link Version submitted:
Editor: @kvarada
Reviewer 1: Elliott Ribner @elliott-ribner Reviewer 2: Aman Kumar Garg @amank90 Archive: TBD
Version accepted: TBD


Description

PyDataPeek is a package that enables data scientists to efficiently generate a visual summary of a dataset. This package includes functions that show the size of the dataset, a visual summary of missing data, a sample of the dataset showing the data types as well as exploratory visualizations for quantitative and qualitative data.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Who is the target audience and what are scientific applications of this package?

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Several Python packages are available that support exploratory data analysis but none are specific to the targeted use cases here - a simple and technologically friendly way of summarizing data.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor 'utility' packages, including 'thin' API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

amank90 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in the comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

Feedback 1:

pip install pydatapeek --extra-index-url=https://test.pypi.org/simple/

Feedback 2:

Feedback 3:

Feedback 4

from PyDataPeek import missing_data_overview
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(
        {'x1': [1.,2,3,4],
         "x2": ["a","b","","d"]})

plt.show(missing_data_overview._make_plot(df))

I don't get any missing value in the plot but when I do this.

from PyDataPeek import missing_data_overview
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(
        {'x1': [1.,2,3,4],
         "x2": ["a","b",None,"d"]})

plt.show(missing_data_overview._make_plot(df))

I start seeing the missing values.

Feedback 5

In sample_data.py, it will take the sample data that potentially has no missing value. If it is expected to pass the data with no missing value then you can ignore this.

results = pd.DataFrame({'sample_record': df.iloc[1]})

Feedback 6

In the cloud generating function, I am getting the column name in the cloud name as well. Is it possible to not have it?


import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from PyDataPeek import word_bubble

df = pd.DataFrame(
        {'x1': ["play cricket","game","amazing","joke"]})

formated_words, stopwords = word_bubble._make_formated_words(df)

plt.show(word_bubble._make_cloud(formated_words, stopwords, 10, 100, 100))
ribner commented 4 years ago

Reviewer: Elliott

Package Review

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Collecting pydatapeek
  Downloading https://test-files.pythonhosted.org/packages/1e/27/5a49ffb2261be9541e88d0ae9e076862e2a8029d779a78812a5f210f850f/pydatapeek-0.1.9-py3-none-any.whl
ERROR: Could not find a version that satisfies the requirement altair_saver<0.2.0,>=0.1.0 (from pydatapeek) (from versions: none)
ERROR: No matching distribution found for altair_saver<0.2.0,>=0.1.0 (from pydatapeek)

Final approval (post-review)

Estimated hours spent reviewing: 4

---#### Review Comments

Altogether, great job on the project. I think there is many useful features contained in the package, and it is well implemented! I found the code and structure, well written, and well documented. I found very few points to improve, but if time allowed to fix there is three things worth noting:

Thank you for your time.

Thanks,

Elliott

moniquewong commented 4 years ago

@amank90 Thanks for your feedback. Below are some comments and updates we have made to our package based on your feedback

moniquewong commented 4 years ago

@elliott-ribner Thanks for your feedback. Below are some comments and updates we have made to our package based on your feedback.