Open sgauravm opened 4 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 10 hours
Reviews based on: v1.2.0
Documentations(readthedocs
)
Pros:
- Nice job for the documentations, great
readthedocs
pages. I have no trouble reading them,
Suggestions:
- There are couple inconsistency issues for
calc_cor
andgenerate_report
.The>>>
is not correctly highlighted.- For
describe_na_values
, used Pandas.DataFrame and rest of functions used (pandas.DataFrame. Probably change it just for consistence.
Functionalities
calc_cor
Pros:
- Great ideas of having this correlation matrix. I saw similar in R, but in Python. I think this one is a really cool feature here.
- Nice graph, clearly labelled graph, and great color choices.
Suggestions:
- If the numbers are large, the numbers will be stacked on top of each other. It will be great if you can fix it.
- Please fix the example in the docstring, it cannot run.
2. `describe_cat_var`
Pros:
> - Great histogram. Great labelling, clear format.
> - I like the idea of making the histogram for a categorical variable since I always forgot to scale the axis.
Suggestions:
> - For the count, the y-axis, should be discrete instead of continuous.
> - Please fix the example in the docstring, it cannot run, the function name is different.
3. `describe_na_values`
Pros:
> - Great ideas of getting the NA values from a dataframe. I remembered from R and I think this function is really useful for figuring out the missing data pattern.
> - Nice function styles, well documented. Love it.
Suggestions:
> - The docstring example won't run, the function name changed. Also, the last two examples are passed in the same dataframe but got different results.
> - For the result dataframe, only showing 0 and 1 as col name probably not very clear. I didn't try it, but I think if it is a very large dataframe, looking at 0s and 1s in the result probably isn't very clear. It will be great to have some summary result numbers.
4. `describe_num_var`
Pros:
> - Great detailed summaries. I really like this type of table, clear and clean report is definitely a plus.
> - Nice plot, great visualizations, perfectly grid together.
Suggestions:
> - The docstring example won't run. Also, maybe add a bin adjustment argument, so users can adjust the bin themselves.
> - For the count, the y-axis, should be discrete instead of continuous.
5. `generate_report`
Pros:
> - Great idea of putting everything together. I love this. Easy to use and really clear steps by steps.
> - Nice separations between functions, so users can know what these steps did.
Suggestions:
> - The docstring example won't run. Also, the example is not for this function.
> - Return a boolean is not really clear for the user, maybe a message will be better.
General Comments
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing:
I was pleased to have the opportunity to review this package. Overall, I would rank this package among the top 5, I will be using from now on for my data exploratory analysis, with a little effort it could be all I need!. All functions are well-documented, the installation went smoothly, and the examples were very helpful. The readme covers all the needed information about the package.
Here are a few suggestions that you might find useful for future improvements.
Order of EDA:
The EDA implemented by eda_analysis
is initiated by giving statistic of the NA in the data frame, which is a useful introductory step. However, I would prefer the EDA to start with looking at my data, for example, the head or the tail. Next, I would put the summary of the data and not the histograms.
Histograms: To check the functionality of the package I used credit-default-data that could be publically accessed. The produced histograms look nice and are correct. One minor thing is that need to be adjusted is the bins used. For example, in the credit-default-data, a couple of plots had bars that tend to go outside the plot or merge together as you could see below.
Binning is also problematic for the numerical variables, all plots were forced to have the same axis which is not always appropriate, for example, age and Limit Balance in the credit-default-data don’t have the same range.
Numerical variable summary: Looks nice, but could be better formatted.
Correlation plot for numerical variable This is a very useful plot but looked very jammed in credit-default-data maybe this could be improved by reducing the font size of the correlation value since a colour label has already been provided.
Edge cases: So I tried to replace values in credit-default-data with ‘NA’ and ‘nan’, but for some reason, these were not picked by the table which shows the number of ‘NA’.
Minor suggestions: 1- Returning True at the end of the analysis is not very informative for the user, I had to go and read the docs of the function to know what it means, consider removing it. 2- Would be nice if the fonts used in this report are consistent.
Nice to have: Things that could be added in the future : 1- Exploring relationships between categorical and numeric variables 2- Including Boxplots
In summary, this package great but could be awesome, by applying minor modifications it will be the package to go to make reproducible EDA and publishable reports. Nice work.
Thanks a lot for reviewing the package.There were a lot of useful comments provided by the reviewers. Some of them we selected and addressed are :
This is the link of the latest release in which all the above issues has been fixed: eda_analysis-v1.3.0
Submitting Author: Gaurav Sinha (@sgauravm ), Cheng (Marvin) Min (@marvinmin ), Yi (James) Liu (@v5y8 ), Sarah Weber (@sweber15 )
Package Name: eda_analysis One-Line Description of Package: The package automates exploratory data analysis process by providing function that generates a general exploratory data analysis report Repository Link: https://github.com/UBC-MDS/edapython Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada ) Reviewer 1: Mohammedalmojtaba Mohammed (@dataubc )
Reviewer 2: Tao Guo (@tguo9 )
Archive: TBD
Version accepted: TBD
Description
Scope
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
This package provides a single function which generates a general exploratory data analysis report. The package simplifies EDA tasks that require a lot of coding effort like describing the data, knowing NA values and plotting the distributions of the variables which are needed to understand the data well.
The target audience for our package includes anyone that wants to understand a data set specifically including data scientists and data analysts. The scientific applications of this package are that users can perform a simplified EDA on a dataframe without the intense coding.
There are other similar packages which can be used for EDA analysis. A package which does a similar thing is pandas profiling. Pandas profiling creates an HTML report, but our package will give output in the ongoing code.
@tag
the editor you contacted:Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor and review templates can be found here