Submission: nursepy(Python)

name: Submit Software for Review about: Use to submit your Python package for peer review title: '' labels: 1/editor-checks, New Submission! assignees: ''

Submitting Author: Group 24 (@merveshin, @evhend, @elliott-ribner ) Package Name: nursepy One-Line Description of Package: A python package for streamlining the front end of the machine learning workflow. Repository Link: https://github.com/UBC-MDS/nursepy Version submitted: v2.1.0 Editor: @kvarada
Reviewer 1: @annychih
Reviewer 2: @jasmineqyj Archive: TBD
Version accepted: TBD

Description

nursepy aims to streamline the front end of the machine learning pipeline by generating descriptive summary tables and figures, automating feature imputation, and automating preprocessing. Automated feature imputations and preprocessing detection have been implemented to minimize time and optimize the processing methods used. The functions in nursepy were developed to provide useful and informative metrics that are applicable to a wide array of datasets.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

nursepy automates the plotting process and the summary statistics while conducting Exploratory Data Analysis tasks. It will handle the NaN values and preprocess the data including one-hot encoding, scaling, and label encoding.

Who is the target audience and what are scientific applications of this package?

Any person who is interested in analyzing and preprocessing data before running machine learning models.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are other individual Python packages that have some similar functions(describe, Altair) but the functions contained in nursepy combines those functions in an elegant way to proceed much analysis easily.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[ ] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x ] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all user-facing functions
[X] Examples for all user-facing functions
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[X] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

annychih: I like the table in the README which outlines the 3 functions' input, output, and descriptions. It's nicely formatted and provides a good overview of the functions. I also liked the step-by-step instructions under the 'Usage' section, but I noticed that the impute() and preproc() functions have descriptions, whereas the eda() function doesn't. If you have time, it might be nice to add a line for the eda() function in case a user scrolls past the 'Features' section and goes directly to 'Usage' for guidance on how to use the package functions.

Readme requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[X] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[X] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[X] Installation instructions
[X] Any additional setup required (authentication tokens, etc)
[X] Brief demonstration usage
[X] Direction to more detailed documentation (e.g. your documentation files or website).
[X] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[X] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[X] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

annychih: I wasn't able to install the package using the installation instructions: pip install -i https://test.pypi.org/simple/ nursepy (screenshot of the error below), so I cloned the repo and tested the functions in Terminal and in a Jupyter notebook.

eda() Function Feedback: Using Terminal, I was unable to see the chart that's outputted using the eda() function (as expected), so I tested the function using Jupyter notebook. Using Jupyter notebook, I was able to see the chart after adding another line to the code: alt.renderers.enable('notebook'). Users may wonder why the chart does not appear if they are calling the function using something like Jupyter notebook / labs. In these cases, it may be beneficial to provide a Warning to let them know that if they are using Jupyter notebook / labs, they need to run an additional line of code to see the output chart.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2.5

Review Comments

Note: Please also see comments above that are specific to package Documentation and Functionality.

The nursepy package includes some nice functions to assist in the data munging and visualization stages of an analysis. The functions are straightforward, easy to understand, and easy to use. Well done!

I particularly like the impute() function and think future developments of this function to include other methods (in addition to currently offered RandomForestClassifier or RandomForestRegressor) will make it even more beneficial to the Python community!

Once the installation and eda() function chart display issues (noted above) are addressed, I feel this package would be a good helper package for data scientists to use.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:
2h

Review Comments

Documentations

The README file is well-structured and descriptive. I found it very easy to follow the instructions and to get an idea of the functionalities. The Features section in README is creative, giving users a quick and clear summary of the contents.

Some minor suggesions:

Your usage part is more like a documentation or a detailed example that walks through the functions. My suggestion is to make it more concise with fewer lines or a high-level conclusion. For example, the impute function is not called until the third code chunk. The examples helped me a lot in testing, but there might also be users who want to skip examples and jump right into the line that calls the function.
It would be better to keep the examples in the docstrings of your individual .py files consistent. For example, from nursepy import eda does not seem like a currently working version.

Functionality

Required version of altair is 3.2.0 so I downgraded mine from 4.0.1 and successfully installed your package. Well done!

eda(): I found this function very useful to draw quick histograms compared to writing altair code myself! Since I tested this in Jupyter Lab, just wanted to remind you that users might run into errors when rendering altair plots. In particular for me, I had to add alt.data_transformers.disable_max_rows(); to disable row limit.
impute(): I think this function is very powerful in terms of including various methods for imputation as well as a comprehensive dictionary of summaries. One suggestion is that users may want to impute data before splitting the dataset (e.g., to pass the entire train portion into cross-validation) so it might be a good idea to give them the option of not specifying a validation set.
I really like the idea of providing a baseline score using Random Forest, so I was wondering if there is a particular reason behind this because different ML model could provide very different results.
The missing_value_counts_ currently is only for train set, might be a good idea to add validation set information as well.
preproc(): When I looked up your documentation before I tested the arguments in the function, I found out that some information is missing. This might be caused by the way you wrote the docstring (and similar issue for eda() as well).

It seems like both impute() and preproc() deal with missing data, perhaps it will be easier for users to find the right function if they are completely different. But I also think it's great to keep imputation in preproc() because this function does a lot of things on its own.

Overall I think this package is very helpful and will become a valuable addition to the Python family for people who use Machine Learning pipeline a lot!

Thank you for your feedback @annychih and in response to your comments:

Addressed

Note	Response
`eda` description to Usage	A short description has been added
Package Installation	Seems to have been a common issue throughout python packages. Installation instructions have been addressed and the README updated.
`eda` output	Added to Usage that function should be in Jupyter Lab (or other IDE) to view output

PR incorporating changes:

PR 62
PR 63

New Release with changes: v3.0.0

Not Addressed

Note	Response
`impute` option of selecting other models	Taken into consideration

Thank you for your feedback @jasmineqyj and in response to your comments:

Addressed

Note	Response
Documentation: More concise usage	Taken into consideration; I would suggest that Read the Docs is a good place for a concise description
Functionality: `alt.data_transformers.disable_max_rows()`	Comment added to Usage (this depends on data set being passed in and should be specified by the user)
Functionality: `preproc` and `eda` missing data in read the docs	Docstrings correctly specify parameters, appears to be a bug when uploaded to read the docs

PR incorporating changes:

PR 62
PR 63

New Release with changes: v3.0.0

Not Addressed

Note	Response
Documentation: Consistency in docstrings (`eda`)	To be addressed in future iterations
Functionality: `impute` option of not specifying validation set `impute` different model selection `impute` `missing_value_counts_` for validation set	These are interesting points and will be considered in future iterations. The structure of the function may need to be adjusted so that if a validation set is not specified, then the function splits the data for the purposes of scoring and then performs imputation on the entire set once the optimal imputation method is determined.

UBC-MDS / software-review