UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: PrepPy (Python) #5

Open camadi opened 4 years ago

camadi commented 4 years ago

Submitting Authors: George Thio (@gptzjs), Matthew Connell (@matthewconnell), Jasmine Qin (@jasmineqyj), Chimaobi Amadi ( @camadi) Package Name: preppy524 One-Line Description of Package: A python package for data preprocessing for machine learning Repository Link: https://github.com/UBC-MDS/PrepPy Version submitted: v1.2.0 Editor: Varada Kolhatkar (@kvarada)
Reviewer 1: Monique Wong (@moniquewong)
Reviewer 2: Mengzhe Huang (@Jamesh4)
Archive: TBD
Version accepted: TBD


Description

preppy524 is a package for Python to help preprocessing in machine learning tasks. There are certain repetitive tasks that come up often when doing a machine learning project and this package aims to alleviate those chores. Some of the issues that come up regularly are: finding the types of each column in a dataframe, splitting the data (whether into train/test sets or train/test/validation sets, one-hot encoding, and scaling features. This package will help with all of those tasks.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

train_valid_test_split: This function splits the data set into train, validation, and test sets.

data_type: This function identifies data types for each column/feature. It returns one dataframe for each type of data.

one-hot: This function performs one-hot encoding on the categorical features and returns a dataframe for the train, test, validation sets with sensible column names.

scaler: This function performs standard scaling on the numerical features.

Machine Learning Engineers, Data Scientists, students and any other person who is interested in preprocessing data before running machine learning models.

No single package does the four different functions of preppy524 but there are some functions that does some part of the preppy524 package.

None

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

No

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

moniquewong commented 4 years ago

[Draft review - work in progress]

Package Review

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality


- [x] **Functionality:** Any functional claims of the software been confirmed.
    - Checked by cloning repository and trying functions
    - No obvious errors
- [x] **Performance:** Any performance claims of the software been confirmed.
- [x] **Automated tests:** Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
   - Might be worth it for future collaborators to document your tests better - one-line comment line should do here
   - Test functions can also be named better (e.g., in [test_datatype.py](https://github.com/UBC-MDS/PrepPy/blob/master/tests/test_datatype.py), `test_datatype1` can be renamed to `test_categorical-data`, other functions also have test function names of "test1", "test2" etc.)
   - Tests in `test_scaler.py` is just one big test. If something fails, it's would be unclear which test function failed since tests aren't broken out into unit tests.
- [x] **Continuous Integration:** Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
   - Seems like release workflow is failing at check style. 
   - Try using `autopep8 --in-place <filename>` on your code to fix a lot of these issues
- [x] **Packaging guidelines**: The package conforms to the pyOpenSci [packaging guidelines](https://www.pyopensci.org/dev_guide/packaging/packaging_guide.html).

Estimated hours spent reviewing: 2.5

---

#### Review Comments
Overall, quite a useful package that resolves some of the pain points I had with `sklearn`. I would definitely download and use your package! I think some minor fixes in explaining the functionality and documentation of tests would make it more appealing for potential users to get introduced to do and start using your package. Good work!
jamesh4 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

Estimated hours spent reviewing: 3

Review Comments

I think overall this package is well implemented and has a strong use case. It helps streamline a lot of annoying and repetitive processes when cleaning data. There were some small issues with the documentation, as well as error and edge-case based test coverage that could use improvement, but the functions themselves work very well. With some polish this is definitely a package I would consider using the in the future.

jasmineqyj commented 4 years ago

Hi James,

Thank you for your valuable feedbacks and we have addressed the following items:

The most recent release could be found through this link.

Thanks,

Jasmine

matthewconnell commented 4 years ago

Hi @moniquewong

Thank you for your valuable feedback! We have addressed the following items:

The most recent release can be found here.

Thanks,

Matt