UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: nursepy(Python) #22

Open merveshin opened 4 years ago

merveshin commented 4 years ago

name: Submit Software for Review about: Use to submit your Python package for peer review title: '' labels: 1/editor-checks, New Submission! assignees: ''


Submitting Author: Group 24 (@merveshin, @evhend, @elliott-ribner ) Package Name: nursepy One-Line Description of Package: A python package for streamlining the front end of the machine learning workflow. Repository Link: https://github.com/UBC-MDS/nursepy Version submitted: v2.1.0 Editor: @kvarada
Reviewer 1: @annychih
Reviewer 2: @jasmineqyj Archive: TBD
Version accepted: TBD


Description

nursepy aims to streamline the front end of the machine learning pipeline by generating descriptive summary tables and figures, automating feature imputation, and automating preprocessing. Automated feature imputations and preprocessing detection have been implemented to minimize time and optimize the processing methods used. The functions in nursepy were developed to provide useful and informative metrics that are applicable to a wide array of datasets.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

nursepy automates the plotting process and the summary statistics while conducting Exploratory Data Analysis tasks. It will handle the NaN values and preprocess the data including one-hot encoding, scaling, and label encoding.

Any person who is interested in analyzing and preprocessing data before running machine learning models.

There are other individual Python packages that have some similar functions(describe, Altair) but the functions contained in nursepy combines those functions in an elegant way to proceed much analysis easily.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

annychih commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

annychih: I like the table in the README which outlines the 3 functions' input, output, and descriptions. It's nicely formatted and provides a good overview of the functions. I also liked the step-by-step instructions under the 'Usage' section, but I noticed that the impute() and preproc() functions have descriptions, whereas the eda() function doesn't. If you have time, it might be nice to add a line for the eda() function in case a user scrolls past the 'Features' section and goes directly to 'Usage' for guidance on how to use the package functions.

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

annychih: I wasn't able to install the package using the installation instructions: pip install -i https://test.pypi.org/simple/ nursepy (screenshot of the error below), so I cloned the repo and tested the functions in Terminal and in a Jupyter notebook.

Screen Shot 2020-03-16 at 3 58 40 PM

eda() Function Feedback: Using Terminal, I was unable to see the chart that's outputted using the eda() function (as expected), so I tested the function using Jupyter notebook. Using Jupyter notebook, I was able to see the chart after adding another line to the code: alt.renderers.enable('notebook'). Users may wonder why the chart does not appear if they are calling the function using something like Jupyter notebook / labs. In these cases, it may be beneficial to provide a Warning to let them know that if they are using Jupyter notebook / labs, they need to run an additional line of code to see the output chart.

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 2.5


Review Comments

Note: Please also see comments above that are specific to package Documentation and Functionality.

The nursepy package includes some nice functions to assist in the data munging and visualization stages of an analysis. The functions are straightforward, easy to understand, and easy to use. Well done!

I particularly like the impute() function and think future developments of this function to include other methods (in addition to currently offered RandomForestClassifier or RandomForestRegressor) will make it even more beneficial to the Python community!

Once the installation and eda() function chart display issues (noted above) are addressed, I feel this package would be a good helper package for data scientists to use.

jasmineqyj commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:
2h


Review Comments

Documentations

The README file is well-structured and descriptive. I found it very easy to follow the instructions and to get an idea of the functionalities. The Features section in README is creative, giving users a quick and clear summary of the contents.

Some minor suggesions:

Functionality

Required version of altair is 3.2.0 so I downgraded mine from 4.0.1 and successfully installed your package. Well done!

doc_issue

Overall I think this package is very helpful and will become a valuable addition to the Python family for people who use Machine Learning pipeline a lot!

evhend commented 4 years ago

Thank you for your feedback @annychih and in response to your comments:

Addressed

Note Response
eda description to Usage A short description has been added
Package Installation Seems to have been a common issue throughout python packages.
Installation instructions have been addressed and the README updated.
eda output Added to Usage that function should be in Jupyter Lab (or other IDE) to view output

PR incorporating changes:

New Release with changes: v3.0.0

Not Addressed

Note Response
impute option of selecting other models Taken into consideration
evhend commented 4 years ago

Thank you for your feedback @jasmineqyj and in response to your comments:

Addressed

Note Response
Documentation:
More concise usage
Taken into consideration; I would suggest that Read the Docs is a good place for a concise description
Functionality:
alt.data_transformers.disable_max_rows()
Comment added to Usage (this depends on data set being passed in and should be specified by the user)
Functionality:
preproc and eda missing data in read the docs
Docstrings correctly specify parameters, appears to be a bug when uploaded to read the docs

PR incorporating changes:

New Release with changes: v3.0.0

Not Addressed

Note Response
Documentation:
Consistency in docstrings (eda)
To be addressed in future iterations
Functionality:
impute option of not specifying validation set
impute different model selection
impute missing_value_counts_ for validation set
These are interesting points and will be considered in future iterations. The structure of the function may need to be adjusted so that if a validation set is not specified, then the function splits the data for the purposes of scoring and then performs imputation on the entire set once the optimal imputation method is determined.