Submission: samplingsimulatorpy (Python)

Submitting Author: Name (@hwilliams10, @lisebraaten, @tguo9, @YueJiangMDSV)
Package Name: samplingsimulatorpy One-Line Description of Package: Create, sample from, and view histograms from virtual populations. Repository Link: https://github.com/UBC-MDS/samplingsimulatorpy Version submitted: 1.1.0 Editor: @kvarada Reviewer 1: @chuusan Reviewer 2: @slbsolomon Archive: TBD
Version accepted: TBD

Description

samplingsimulatorpy is a Python package that allows users to generate virtual populations which can be sampled from in order to compare and contrast sample vs sampling distributions for different sample sizes. The package also allows users to sample from the generated virtual population (or any other population), plot the distributions, and view summaries for the parameters of interest.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [x] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

This package is intended to assist in teaching and/or learning basic statistical inference by allowing users to generate virtual populations to compare and contrast sampling vs sample distributions and parameters.

Who is the target audience and what are scientific applications of this package?

The target audience is instructors and/or students teaching or learning basic statistical inference.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

To the best of our knowledge, there is currently no existing Python package with the specific functionality to create virtual populations and make the specific sample and sampling distributions described above. We do make use of many existing Python packages and expand on them to make very specific functions. These include:

scipy.stats to get distribution functions
np.random to generate random samples
Altair to create plots

Python pandas already includes some summary statistics functions such as .describe(), however our package will be more customizable. Our summary will only include the statistical parameters of interest and will provide a comparison between the sample, sampling, and true population parameters.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

N/A

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has an OSI approved license
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[X] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[ ] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1

Review Comments

The README is very readable - I liked seeing the author information in a table!
Under Usage, under the stat_summary function and Arguments subheading, the description of samples is listed as "the drawed samples". For grammatical correctness, may I recommend changing it to "the drawn samples"
The link to readthedocs in the README under Documentation wasn't working, but I was able to get to the page via the badge
I may have missed it, but I only saw credits for use of the cookiecutter template, and didn't see anything for Python and your dependencies - please let me know if I missed it, otherwise I would recommend this site for generating python citations: http://citebay.com/
In examples of test files, under examples, I would recommend not putting the example as a variable for ease of use, ie. change from samples = draw_samples(pop, 3, [5, 10, 15, 20]) to draw_samples(pop, 3, [5, 10, 15, 20]) - not sure if there is a reason for you putting these into a variable, so if there is, please disregard! This is applicable to draw_samples and generate_virtual_pop
Under Usage, I would recommend adding import numpy as np as it is required for an input in generate_virtual_pop
Under Usage, there is inconsistency with function names - some places it is plot_sample_dist and plot_sampling_dist, and elsewhere it is plot_sample_hist and plot_sampling_hist, my original imports failed because of this when I was following the Example Usage Scenario
I quite liked the feature of the plotting functions that the graphs were automatically side by side - made it easy to read and compare
For the purposes of consistency, I would consider either making the inputs for generate_virtual_pop and stat_summary that are numpy objects both either numpy objects or strings (if you choose to make both strings, my comment about adding import numpy as np to Usage is irrelevant)
Documentation was all very readable! Kudos to you guys on great work!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all user-facing functions
[X] Examples for all user-facing functions
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[X] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[X] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[X] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[X] Installation instructions
[X] Any additional setup required (authentication tokens, etc)
[X] Brief demonstration usage
[X] Direction to more detailed documentation (e.g. your documentation files or website).
[X] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[ ] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[X] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2 hours

Review Comments

Summary

It is super efficient to deploy such a great package do statistical inferencing compared to the traditional cumbersome way. Great job, guys! The development follows a standard python package creation process with detailed documentation, fully covered tests, and is successfully built and released. The package has a range of useful functions. Meanwhile, the naming is very intuitive. Overall the user experience is good. Below are detailed comments and some suggestions from different perspectives:

Installation

The installation process is very smooth and very fast. The installation instruction is very clear!

Functionality

generate_virtual_pop(size, population_name, distribution_func, *para)
- I highly recommend the developer integrate the import of numpy within the function so that users don’t need to do it again.
- Also, the way of importing function from the package should be improved so that the example from README file should work.
- The distribution_func argument is not very user-friendly. Users have to type np.random before the real distribution, and when we want to find the right name of the distribution we have to refer documentations of np.random. Therefore, a better way to integrated numpy function with the function is expected.
draw_samples(pop, reps, n_s)
- The type of n_s argument is list according to the documentation. A more general intake type is expected, e.g. int.
plot_sampling_hist(samples)
- Please noted that in the README, I believe that there is a typo of the name of the function.
stat_summary(pop, samples, parameter)
- I think it will be much more convenient if the function can return statistics metrics such as mean and std by default. A good reference is the design of well-known describe() function. For users, it is cumbersome to specify the parameter using a list and it required a special format e.g. ['np.mean', 'np.std'].

Documentation

The function documentations are well-designed and succinct. However, one thing I want to bring to authors' attention is that documentation for function [stat_summary](https://samplingsimulatorpy.readthedocs.io/en/latest/source/samplingsimulatorpy.html#module-samplingsimulatorpy.stat_summary) is wrongly formatted. Screen Shot 2020-03-21 at 1 59 24 AM

Tests

It is impressive to have 100% coverage. Well done! I only have some minor suggestion in terms of the brief explanation of each test.

test_draw_samples(): The short introduction is missing.
test_stat_summary: It is a good practice to have a short introduction to the test function like the other test functions in your package. However, users might be a little bit confused about what the TypeError error is, which argument the test is referring to. Therefore, a more explicit expression is expected.

Others

In CONTRIBUTORS file, it is nice to list contributors’ Github handles. It would be perfect, if the handle is linked.

Conclusion

I am excited to see the further development of the package! Once again, good job, guys!

Thank you @slbsolomon and @chuusan for your reviews and feedback! Please see our release v.2.0.0 for the updates we have made based on your recommendations.

The following changes have been made based on your feedback:

Updated Usage section of README to fix grammatical errors, typos and inconsistency of function names
Added import numpy as np to Example Usage Scenario
Added References section to README that credits Python and other dependencies
Updated inputs for generate_virtual_pop and stat_summary to all be numpy objects (by updating any strings to numpy objects)
Fixed formatting of stat_summary documentation
Added missing short introductions and comments in test functions and updated TypeError expressions to be more explicit
Added authors GitHub handles to CONTRIBUTORS file

@slbsolomon - all four of us have tested the Read the Docs link in the README and were able to access the documentation. I'm glad you were able to access through the badge - not sure why the link wasn't working for you but hopefully it is now. As for assigning a variable name to our example use of generate_virtual_pop and draw_samples, we had done this since our plotting functions rely on these outputs as inputs.

@chuusan - the choice of making the expected type of the n_s argument (which we have since updated to be named sample_size for clarity) a list was intentional as the function expects multiple integer values as input.

Thank you again for all your work on reviewing our package. We really appreciate your time and effort!

UBC-MDS / software-review