Submission: pymleda (Python)

Submitting Author:

Sang Yoon Lee(@rissangs)
Yazan Saleh (@yaz-saleh)
Saule Atymtayeva (@Saule-Atymtayeva)
Tanmay Sharma (@tanmaysharma19)

Package Name: pymleda One-Line Description of Package: Python package that helps with preliminary eda for supervised machine learning tasks Repository Link: https://github.com/UBC-MDS/pymleda Version submitted: 0.2.5 Editor: Tiffany Timbers (@ttimbers) Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Description

Include a brief paragraph describing what your package does: The package main goal is to streamline preliminary EDA for a dataset given as a Pandas dataframe. The package contains functions and classes that help perform various data preparation and wrangling tasks such as data splitting, exploration, imputation, and scaling. These functionalities were identified as commonly-performed tasks in supervised machine learning settings but may provide value in other project types as well. Specifically speaking the following 3 functions and classes are included: -- SupervisedData is a wrapper class that splits a pandas dataframe into train and test sets and further into X and y subsets based on a list of user-provided columns.
-- dftype() function will return the type of columns and variables for the input data frame. Furthermore, if there are non-numeric columns, it will return the unique values of non-numeric columns and their length. -- autoimpute_na() function to identify and impute missing values for different attributes in a given pandas dataframe. -- dfscaling() function to apply standard scaling to the numerical features in a pandas dataframe.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences): The pymleda package is intended to help with EDA for supervised machine learning tasks. It helps with tasks such as exploring variable types and summary stats, imputing NAs, scaling and centering numerical columns, as well as splitting the data into training, test, X, and y subsets.
Who is the target audience and what are scientific applications of this package?
The target audience for the package are ML and data science practitioners who would like to do preliminary EDA and wrangling of their dataset prior to moving on to other tasks in their pipeline.
Are there other Python packages that accomplish the same thing? If so, how does yours differ? There are other existing packages such as scikit-learn and pandas that contain some similar functionality. For example, pandas provides users with separate functions such as isnull(), isna(), and notna() to detect missing values and fillna(), interpolate() to fill them. Our pymleda package intends to augment the existing functionality of these packages with the goal of increasing ease of use. For example, imputing and scaling (via autoimpute_na() and dfscaling()) will automatically identify the columns to modify. Similarly, dftype() will return a summary dataframes for numeric columns (containing output of pandas's describe()) and another for non-numeric columns (containing unique values). Supervised_Data class provides convenience attributes for accessing train, test, x, and y portions of the dataset relieving the user from having to keep track of the different variables.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[ ] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hours

Review Comments

Hi Team,

Overall, well done! Thanks for your fantastic work. Here are some suggestions that you may want to make to improve your package performance:

The installation was failed when I tried to install the package. I tried to fix the error and here is the possible solution: pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pymleda
When I tried to use the SupervisedData by following the usage, it gave me this error: NameError: name 'SupervisedData' is not defined. The way I solved it is adding from pymleda.pymleda import SupervisedData on your usage and your documentation.
The hyperlink on the documentation about sklearn’s function documentation does not work.
It would be good to show more specific examples under the usage part on the README.
It would be nice to add all author's names in the pyproject.toml file.

These are all minor pieces of advice that I would like to suggest. Good job! Good luck with your next block!

Best wishes, Tingyu

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of the package and any non-standard dependencies in README
[X] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all user-facing functions
[X] Examples for all user-facing functions
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[ ] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[X] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[X] Short description of goals of the package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[X] Installation instructions
[X] Any additional setup required (authentication tokens, etc)
[X] Brief demonstration usage
[X] Direction to more detailed documentation (e.g. your documentation files or website).
[X] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[X] Citation information

Usability

[X] The documentation is easy to find and understand
[ ] The need for the package is clear
[X] All functions have documentation and associated examples for use

Functionality

[ ] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[X] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2

Review Comments

Hello everyone, first of all, congratulations on creating such a structured and detailed package.

The installation command is not working due to the package version of Pandas. A possible solution is to include the --extra-index-url https://pypi.org/simpl statement for the versions to be installed correctly since this package is being hosted in test.PyPI
All the tests were successfully passed during revision.
The README is clear in establishing the package within the python ecosystem and explaining the aggregated value of SupervisedData and autoimpute_na(). Nonetheless, I consider that some further explanation for the other two functions is missing for them to stand out.

In the usage instructions, I suggest for the following replacement to have reproducible usage instructions.

supervised_data = SupervisedData(df, x_cols = ['feature1', 'feature2'], y_cols = ['target'])

supervised_data = pymleda.SupervisedData(df, x_cols = ['feature1', 'feature2'], y_cols = ['target'])

I would suggest having a clearer and deeper explanation of the function/classes´ usage in the README.
Perhaps include some metadata for future contacts like emails or professional profiles.
It would be great to include in the package justification what are the current splitting/preprocessing options lacking, and how this drove you to create the package.
The dfscaling function is only returning the numeric features, not the whole original data frame where the numeric features are scaled. From reading the documentation, my understanding is that the column number of the original data frame should be maintained. If I understood it incorrectly it would be great to specify this condition in the docstring.

Overall, amazing work, I hope this review finds you well and that you have an amazing rest of your week, you deserve it. Sincerely, Santiago Rugeles Schoonewolff

Thank you @Tammy1128 and @ansarusc for your detailed reviews. We much appreciate your inputs! We've fixed the installation instructions in the Readme as per your feedback. We are unable to address all of your concerns at the given time since active development of the package is being halted with the end of DSCI-524 as per our team's discussion. We would bear in mind some of your suggestions for our future development work and try to incorporate the best practices :-)

UBC-MDS / software-review-2021