Submission: aridanalysis (Python)

Submitting Authors:

Santiago Rugeles (@ansarusc)
Neel Phaterpekar (@nphaterp)
Daniel Ortiz (@danielon-5)
Craig McLaughlin (@cmmclaug)

Package Name: aridanalysis One-Line Description of Package: DRY out your regression analysis! Repository Link: https://github.com/UBC-MDS/aridanalysis_py Version submitted: 0.4.2 Editor: Tiffany Timbers (@ttimbers) Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Description

As Data Scientists, being able to perform Exploratory Data Analysis as well as Regression Analysis are paramount to the process of analyzing trends in data. Moreover, following the DRY (Do Not Repeat Yourself) principle is regarded as a majority priority for maximizing code quality. Yet, often times Data Scientists facing these tasks will start the entire process from scratch, wasting both time and effort while compromising code quality. The aridanalysis package strives to remedy this problem by giving users an easy-to-implement EDA function alongside 3 robust statistical tests that will simplify these analytical processes and produce an easy to read interpretation of the input data. Users will no longer have to write many lines of code to explore their data effectively.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):
- This aridanalysis package falls under data munging because we provide functions to perform exploratory data and produce regression analysis models given provided data.
- The aridanalysis package is considered a data visualization package because we provide an arid_eda function to produce a number of data visualizations given the data.
Who is the target audience and what are scientific applications of this package?
- The target audience is machine learning enthusiasts looking to expand upon Sci-kit Learn models to explore inferential questions with their R style statistical models.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?
- Yes, the statsmodels package provides a large library of R style statistical models and functions. Our package differs in that we have focused and simplified the interface while also providing an associated Sci-Kit Learn model to leverage both predictive and inferential model examples.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2 hours

Review Comments

Hi aridanalysis team,

Congratulations on completing your package. I think the team did a great job there and I really like the idea behind your package and how this can make data scientist life easy with this package. I enjoyed reviewing your work. Below you can find my suggestions which I hope they maybe helpful for you guys:

The package functions are elaborate and easy to understand, in particular I love to see your team's motivation to create this package
It is always good to see related packages with yours, so I can see the difference, and pros and cons.

Some suggestions I found in the package:

I don't see a docs when I click the docs badge into the read the Docs website
In the usage section, actually I can not run through all sample codes after I install this package:
- maybe you need to call functions like aa.aridanalysis.arid_eda()
- house_prices data set are not imported
- maybe it is better to show the output of all sample function executions like: display the EDA plots and show the regression results
the comments in tests are not as detailed as source code

Congrats again and well done.

Zhiyong

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[ ] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5h

Review Comments

Hi aridanalysis team,

Congratulations on completing your package! I really enjoy reviewing the package. Below you can see some suggestions which I hope will be useful for further improvement:

I follow the instruction in the README.md to install the package and run the examples first, but the function reported an error that "module 'aridanalysis' has no attribute 'arid_eda'". When we use the pacakge, we should use from aridanalysis import aridanalysis as aa rather than import aridanalysis as aa. It might be better if you can clear it in the usage instruction
It might be better if you can test the workflow in usage: in usage session in README.md, there is a single quote is missing in line 5, a blank is redundant in the last df, and some inputs should be a string rather than a variable. Also, it might be better if you can put the expected output in usage session.
When I use the example to test arid_linreg function, an error was reported: name 'y' is not defined. The error is because we should input a list of str here. It might be better if you can use a correct example to replace the old one in usage and add proper error message to indicate this type error.
Great job with setting up so many tests and have a coverage rate higher than 90%! However, it might be better to add more tests of input test in the main function. If the input is incorrect, the function could stop early.
It might be better if the output plot could show automatically - the current output is alt.HConcatChart(...)

Hi @wiwang, thank you for taking the time to help us review our package! I will attempt to address your comments below:

The package functions are elaborate and easy to understand, in particular I love to see your team's motivation to create this package

I'm glad you found our descriptions and motivation clear! It was certainly important to clearly define our packages role within the Python regression ecosystem!

It is always good to see related packages with yours, so I can see the difference, and pros and cons.

Thanks! Again with a crowded Python regression package ecosystem we thought it was important to delineate our package niche within the landscape.

I don't see a docs when I click the docs badge into the read the Docs website

Great catch! This is an important one, I've created an issue to get to the bottom of this: https://github.com/UBC-MDS/aridanalysis_py/issues/89

In the usage section, actually I can not run through all sample codes after I install this package:

maybe you need to call functions like aa.aridanalysis.arid_eda()

house_prices data set are not imported

maybe it is better to show the output of all sample function executions like: display the EDA plots and show the regression results

Thanks for helping iron out these discrepancies. You're right, it looks like from the other reviewer we need to update our vignette import statement! Issue created to resolve this here: https://github.com/UBC-MDS/aridanalysis_py/issues/90. We should also add a line to import the house_prices dataset, I've appended to that same issue.

the comments in tests are not as detailed as source code

This is a good point and something to consider as a future improvement to our test suite!

Thanks again for helping us improve aridanalysis_py! If you have any further questions about our package, don't be shy!

aridanalysis team

Hello @yhchen20, thank you for reviewing our package! I'll try to respond to your insights below:

I follow the instruction in the README.md to install the package and run the examples first, but the function reported an error that "module 'aridanalysis' has no attribute 'arid_eda'". When we use the pacakge, we should use from aridanalysis import aridanalysis as aa rather than import aridanalysis as aa. It might be better if you can clear it in the usage instruction

Agreed! We have an issue here to fix this: https://github.com/UBC-MDS/aridanalysis_py/issues/90

It might be better if you can test the workflow in usage: in usage session in README.md, there is a single quote is missing in line 5, a blank is redundant in the last df, and some inputs should be a string rather than a variable. Also, it might be better if you can put the expected output in usage session.

Thanks for including such detailed analysis! I've added the vignette suggestion fixes to the previous issue!

When I use the example to test arid_linreg function, an error was reported: name 'y' is not defined. The error is because we should input a list of str here. It might be better if you can use a correct example to replace the old one in usage and add proper error message to indicate this type error.

The vignette seems to be out of date! We're going to combine all of the vignette resolution fixes into https://github.com/UBC-MDS/aridanalysis_py/issues/90.

Great job with setting up so many tests and have a coverage rate higher than 90%! However, it might be better to add more tests of input test in the main function. If the input is incorrect, the function could stop early.

Maybe some clarification on this comment is necessary. Are you suggesting adding more defensive checks to the functions?

It might be better if the output plot could show automatically - the current output is alt.HConcatChart(...)

We will have to discuss this to get to the bottom of the reasoning behind this; this is certainly a valid critique!

Thank you again for your efforts in providing feedback. We will use your input to help polish our package!

aridanalysis team

UBC-MDS / software-review-2021