Submission: PrepPy (Python)

Submitting Authors: George Thio (@gptzjs), Matthew Connell (@matthewconnell), Jasmine Qin (@jasmineqyj), Chimaobi Amadi ( @camadi) Package Name: preppy524 One-Line Description of Package: A python package for data preprocessing for machine learning Repository Link: https://github.com/UBC-MDS/PrepPy Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Monique Wong (@moniquewong)
Reviewer 2: Mengzhe Huang (@Jamesh4)
Archive: TBD
Version accepted: TBD

Description

preppy524 is a package for Python to help preprocessing in machine learning tasks. There are certain repetitive tasks that come up often when doing a machine learning project and this package aims to alleviate those chores. Some of the issues that come up regularly are: finding the types of each column in a dataframe, splitting the data (whether into train/test sets or train/test/validation sets, one-hot encoding, and scaling features. This package will help with all of those tasks.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [x] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

train_valid_test_split: This function splits the data set into train, validation, and test sets.

data_type: This function identifies data types for each column/feature. It returns one dataframe for each type of data.

one-hot: This function performs one-hot encoding on the categorical features and returns a dataframe for the train, test, validation sets with sensible column names.

scaler: This function performs standard scaling on the numerical features.

Who is the target audience and what are scientific applications of this package?

Machine Learning Engineers, Data Scientists, students and any other person who is interested in preprocessing data before running machine learning models.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

No single package does the four different functions of preppy524 but there are some functions that does some part of the preppy524 package.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

None

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

[Draft review - work in progress]

Package Review

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
- Usage section can explain the functionality more clearly. This can be done by showing example inputs and outputs (as is shown in Read the Docs) or at the minimum explain what the code would return
- For instance, it's not clear that pp.data_type(my_data)['num'] returns a data frame with numerical columns only from the original data frame
- Maybe even saying text_columns = pp.data_type(my_data)['num'] would help with interpreting what these functions do
[ ] Function Documentation: for all user-facing functions
- Can see these here: https://preppy524.readthedocs.io/en/latest/genindex.html
- Rendering of the docs of the pp.scalerseems off in the description. There's a bunch of ":param" that shows up.
[x] Examples for all user-facing functions
- I can see these in Read the Docs
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.
- In CONTRIBUTORS.md

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
- Seems like the release badge is failing
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
- Largely there, would be helpful to link the relevant examples from Read the Docs to the part of the usage section for additional help
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[ ] Brief demonstration usage
- See comments on usage section above
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages
- Can mention how it differs from use of Pipelines in sklearn and if this code can be used within pipelines
[x] Citation information

Functionality

[ ] Installation: Installation succeeds as documented.

I got this error when trying to install:


ERROR: Could not find a version that satisfies the requirement pytest-cov<3.0.0,>=2.8.1 (from preppy524) (from versions: none)
ERROR: No matching distribution found for pytest-cov<3.0.0,>=2.8.1 (from preppy524)


- [x] **Functionality:** Any functional claims of the software been confirmed.
    - Checked by cloning repository and trying functions
    - No obvious errors
- [x] **Performance:** Any performance claims of the software been confirmed.
- [x] **Automated tests:** Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
   - Might be worth it for future collaborators to document your tests better - one-line comment line should do here
   - Test functions can also be named better (e.g., in [test_datatype.py](https://github.com/UBC-MDS/PrepPy/blob/master/tests/test_datatype.py), `test_datatype1` can be renamed to `test_categorical-data`, other functions also have test function names of "test1", "test2" etc.)
   - Tests in `test_scaler.py` is just one big test. If something fails, it's would be unclear which test function failed since tests aren't broken out into unit tests.
- [x] **Continuous Integration:** Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
   - Seems like release workflow is failing at check style. 
   - Try using `autopep8 --in-place <filename>` on your code to fix a lot of these issues
- [x] **Packaging guidelines**: The package conforms to the pyOpenSci [packaging guidelines](https://www.pyopensci.org/dev_guide/packaging/packaging_guide.html).

Estimated hours spent reviewing: 2.5

---

#### Review Comments
Overall, quite a useful package that resolves some of the pain points I had with `sklearn`. I would definitely download and use your package! I think some minor fixes in explaining the functionality and documentation of tests would make it more appealing for potential users to get introduced to do and start using your package. Good work!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
The package summary is clear and concise.
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
Installation instructions and dependencies are included in the README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally
The documentation on readthedocs includes example input, calls, and outputs of all functions.
[x] Function Documentation: for all user-facing functions
All the documentation for the functions are there on readthedocs, but there are some pretty severe formatting and rendering issues, especially on the scaler module. Formatting is inconsistent overall.
[x] Examples for all user-facing functions
All included in readthedocs.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.
Contact information in CONTRIBUTORS.md

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
The release badge is not passing, but otherwise everything else looks in order.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
Adequate summary of package and link to readthedocs.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[ ] Brief demonstration usage
No demonstration of usage in the README
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information
Credits and citations are given

Functionality

[ ] Installation: Installation succeeds as documented.
Although pip install works for me, I was not able to import the package as instructed in the README. Instead I called the functions directly from the root directory of a cloned repo.
[x] Functionality: Any functional claims of the software been confirmed.
All the functions work as intended using typical inputs, and basic edge cases.
[x] Performance: Any performance claims of the software been confirmed.
Didn't run into any noticeable performance issues.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
All tests pass locally.
Tests lack documentation. A one line summary of the function of the test in the code, as well as an error message telling the user what went wrong would be helpful.
Could cover different use cases, and edge cases.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
Has continuous integration, although release build is failing on github repo.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

Estimated hours spent reviewing: 3

Review Comments

I think overall this package is well implemented and has a strong use case. It helps streamline a lot of annoying and repetitive processes when cleaning data. There were some small issues with the documentation, as well as error and edge-case based test coverage that could use improvement, but the functions themselves work very well. With some polish this is definitely a package I would consider using the in the future.

Hi James,

Thank you for your valuable feedbacks and we have addressed the following items:

We added more examples to the usage section in README instead of one line of code, which does not seem adequate for users just started using our package
We changed the name of the package from PrepPy to preppy524 last week and instructions on README seemed to be outdated. They should be updated to the latest version now!
Comments on functionalities were added to our test cases. I agree with you that it's helpful to have clearly documented tests so that users know what went wrong
We will work on polishing our functions to make them more useful as well as implementing more edge-case based test coverages in the future

The most recent release could be found through this link.

Thanks,

Jasmine

Hi @moniquewong

Thank you for your valuable feedback! We have addressed the following items:

Installation by following instructions in the README: We have updated the README with the proper instructions. We had an issue with testPyPI and had to change the name of the function last minute.
Updated test documentation: each test now has a one-line comment about what it does, and the tests have more sensible names to make what is happening clearer if something fails.
Updated the README with more information and documentation on how to use the package.

The most recent release can be found here.

Thanks,

Matt

UBC-MDS / software-review