Open merveshin opened 4 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.annychih: I like the table in the README which outlines the 3 functions' input, output, and descriptions. It's nicely formatted and provides a good overview of the functions. I also liked the step-by-step instructions under the 'Usage' section, but I noticed that the
impute()
andpreproc()
functions have descriptions, whereas theeda()
function doesn't. If you have time, it might be nice to add a line for theeda()
function in case a user scrolls past the 'Features' section and goes directly to 'Usage' for guidance on how to use the package functions.
Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
annychih: I wasn't able to install the package using the installation instructions:
pip install -i https://test.pypi.org/simple/ nursepy
(screenshot of the error below), so I cloned the repo and tested the functions in Terminal and in a Jupyter notebook.
eda()
Function Feedback: Using Terminal, I was unable to see the chart that's outputted using theeda()
function (as expected), so I tested the function using Jupyter notebook. Using Jupyter notebook, I was able to see the chart after adding another line to the code:alt.renderers.enable('notebook')
. Users may wonder why the chart does not appear if they are calling the function using something like Jupyter notebook / labs. In these cases, it may be beneficial to provide aWarning
to let them know that if they are using Jupyter notebook / labs, they need to run an additional line of code to see the output chart.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 2.5
Note: Please also see comments above that are specific to package Documentation and Functionality.
The
nursepy
package includes some nice functions to assist in the data munging and visualization stages of an analysis. The functions are straightforward, easy to understand, and easy to use. Well done!I particularly like the
impute()
function and think future developments of this function to include other methods (in addition to currently offeredRandomForestClassifier
orRandomForestRegressor
) will make it even more beneficial to the Python community!Once the installation and
eda()
function chart display issues (noted above) are addressed, I feel this package would be a good helper package for data scientists to use.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing:
2h
Documentations
The README file is well-structured and descriptive. I found it very easy to follow the instructions and to get an idea of the functionalities. The Features
section in README is creative, giving users a quick and clear summary of the contents.
Some minor suggesions:
Your usage part is more like a documentation or a detailed example that walks through the functions. My suggestion is to make it more concise with fewer lines or a high-level conclusion. For example, the impute
function is not called until the third code chunk. The examples helped me a lot in testing, but there might also be users who want to skip examples and jump right into the line that calls the function.
It would be better to keep the examples in the docstrings of your individual .py
files consistent. For example, from nursepy import eda
does not seem like a currently working version.
Functionality
Required version of altair
is 3.2.0 so I downgraded mine from 4.0.1 and successfully installed your package. Well done!
eda()
: I found this function very useful to draw quick histograms compared to writing altair
code myself! Since I tested this in Jupyter Lab, just wanted to remind you that users might run into errors when rendering altair
plots. In particular for me, I had to add alt.data_transformers.disable_max_rows();
to disable row limit.
impute()
: I think this function is very powerful in terms of including various methods for imputation as well as a comprehensive dictionary of summaries. One suggestion is that users may want to impute data before splitting the dataset (e.g., to pass the entire train portion into cross-validation) so it might be a good idea to give them the option of not specifying a validation set.
I really like the idea of providing a baseline score using Random Forest, so I was wondering if there is a particular reason behind this because different ML model could provide very different results.
The missing_value_counts_
currently is only for train set, might be a good idea to add validation set information as well.
preproc()
: When I looked up your documentation before I tested the arguments in the function, I found out that some information is missing. This might be caused by the way you wrote the docstring (and similar issue for eda()
as well).
impute()
and preproc()
deal with missing data, perhaps it will be easier for users to find the right function if they are completely different. But I also think it's great to keep imputation in preproc()
because this function does a lot of things on its own.Overall I think this package is very helpful and will become a valuable addition to the Python family for people who use Machine Learning pipeline a lot!
Thank you for your feedback @annychih and in response to your comments:
Addressed
Note | Response |
---|---|
eda description to Usage |
A short description has been added |
Package Installation | Seems to have been a common issue throughout python packages.Installation instructions have been addressed and the README updated. |
eda output |
Added to Usage that function should be in Jupyter Lab (or other IDE) to view output |
PR incorporating changes:
New Release with changes: v3.0.0
Not Addressed
Note | Response |
---|---|
impute option of selecting other models |
Taken into consideration |
Thank you for your feedback @jasmineqyj and in response to your comments:
Addressed
Note | Response |
---|---|
Documentation:More concise usage | Taken into consideration; I would suggest that Read the Docs is a good place for a concise description |
Functionality:alt.data_transformers.disable_max_rows() |
Comment added to Usage (this depends on data set being passed in and should be specified by the user) |
Functionality: preproc and eda missing data in read the docs |
Docstrings correctly specify parameters, appears to be a bug when uploaded to read the docs |
PR incorporating changes:
New Release with changes: v3.0.0
Not Addressed
Note | Response |
---|---|
Documentation:Consistency in docstrings (eda ) |
To be addressed in future iterations |
Functionality:impute option of not specifying validation setimpute different model selectionimpute missing_value_counts_ for validation set |
These are interesting points and will be considered in future iterations. The structure of the function may need to be adjusted so that if a validation set is not specified, then the function splits the data for the purposes of scoring and then performs imputation on the entire set once the optimal imputation method is determined. |
name: Submit Software for Review about: Use to submit your Python package for peer review title: '' labels: 1/editor-checks, New Submission! assignees: ''
Submitting Author: Group 24 (@merveshin, @evhend, @elliott-ribner ) Package Name: nursepy One-Line Description of Package: A python package for streamlining the front end of the machine learning workflow. Repository Link: https://github.com/UBC-MDS/nursepy Version submitted: v2.1.0 Editor: @kvarada
Reviewer 1: @annychih
Reviewer 2: @jasmineqyj Archive: TBD
Version accepted: TBD
Description
nursepy
aims to streamline the front end of the machine learning pipeline by generating descriptive summary tables and figures, automating feature imputation, and automating preprocessing. Automated feature imputations and preprocessing detection have been implemented to minimize time and optimize the processing methods used. The functions innursepy
were developed to provide useful and informative metrics that are applicable to a wide array of datasets.Scope
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
nursepy
automates the plotting process and the summary statistics while conducting Exploratory Data Analysis tasks. It will handle the NaN values and preprocess the data including one-hot encoding, scaling, and label encoding.Any person who is interested in analyzing and preprocessing data before running machine learning models.
There are other individual Python packages that have some similar functions(describe, Altair) but the functions contained in
nursepy
combines those functions in an elegant way to proceed much analysis easily.@tag
the editor you contacted:Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor and review templates can be found here