Group-10-Sanityze - Githubissues

Submitting Author: Name @tzoght All current maintainers: ( @tzoght, @xXJohamXx, @caesarw0) Package Name: Sanityze One-Line Description of Package: This package provides utilities to spot and redact PII from Pandas data frames. Repository Link: https://github.com/UBC-MDS/sanityze Version submitted: 0.1.3 Editor: @fdandrea Reviewer 1: Chenyang Wang Reviewer 2: Markus Nam
Reviewer 3: Marian Agyby Reviewer 4: Chen Li

Description

Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from Pandas data frames.
PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
Data munging: This package provides utilities to spot and redact PII from Pandas data frames.
Who is the target audience and what are scientific applications of this package?

Data scientists working with files that contain PII that will be used for analysis.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Yes, the closet Python package in functionality to sanityze is scrubadub which is a package for finding and removing PII from text. However, the package is not designed to work with Pandas data frames, or other data structures.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.

[ ] Vignette(s) demonstrating major functionality that runs successfully locally.

The simple quick start example does not work for me. e.g.

>> from sanityze import Cleanser, EmailSpotter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Cleanser' from 'sanityze' (/opt/miniconda3/envs/fxtracker/lib/python3.9/site-packages/sanityze/__init__.py)

[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).
- [x] License

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[ ] All functions have documentation and associated examples for use

function documentation could be longer.
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.

Not straight forward to add_spotter() and not easy to guess the unique ID when performing remove_spotter().
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
Incorrect syntax for assert was used. e.g.
```
assert(len(c.chain) > 0,"Cleanser should have at least one spotter in the chain")
```
It should be
```
assert len(c.chain) > 0, ("Cleanser should have at least one spotter in the chain")
```
TO BE TESTED
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
  
  The number of characters on some lines exceeds the 79 characters limit suggested by PEP 8.

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2.5 hours

Review Comments

The design of the package is quite robust as it can support different types of spotter. i.e. future enhancement can be easily achieved. Nonetheless, there are a few observations from me:

There is a simple quick start example in the README which is good. It would be better by providing a concrete example to show a dataframe's content before clean and after clean. I could only find similar information in the docstring of the functions.
For the user-facing functions, it would be clearer if the input parameter(s), the returned value(s) and the corresponding type(s) can be stated in documentations.
It would be better if flake8 can be included in ci so as to have an automatic style check.
A typo in the simple quick start example cleaner.add_spotter(...) - should be cleanser (not cleaner)
I think the function getUID() has been renamed to getSpotterUID(). So the one in README is not up-to-date.

          ## Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.

[ ] Vignette(s) demonstrating major functionality that runs successfully locally.

The simple quick start example does not work for me. e.g.

>> from sanityze import Cleanser, EmailSpotter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Cleanser' from 'sanityze' (/opt/miniconda3/envs/fxtracker/lib/python3.9/site-packages/sanityze/__init__.py)

[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [x] Current package version (on PyPI / Conda).
- [x] License

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[ ] All functions have documentation and associated examples for use

function documentation could be longer.
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.

Not straight forward to add_spotter() and not easy to guess the unique ID when performing remove_spotter().
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
Incorrect syntax for assert was used. e.g.
```
assert(len(c.chain) > 0,"Cleanser should have at least one spotter in the chain")
```
It should be
```
assert len(c.chain) > 0, ("Cleanser should have at least one spotter in the chain")
```
TO BE TESTED
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)
  
  The number of characters on some lines exceeds the 79 characters limit suggested by PEP 8.

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2.5 hours

Review Comments

First of all, the package is complete and detailed and I can understand the purpose of the package easily
The coding part for this package is much more complicated than the respected R package. I am wondering if they can achieve the same purpose, is there any possibility to make this Python coding simpler?
There is no discussion on why to use MIT license.
There is no output on the readme part, and I think it could be easier for people who are looking for such packages to see some outputs on the readme so that they can make sure this is what they want in a short time because there are many similar packages.
There are some typos in the words part. For example, in the function docstring of class Cleanser, "It's purpose is to clean the data frame" should be "The purpose is to clean the data frame". There is no harm to do another check on the spelling.

There is a simple quick start example in the README which is good. It would be better by providing a concrete example to show a dataframe's content before clean and after clean. I could only find similar information in the docstring of the functions.

Good suggestion, we will look into it.

Regarding the following two badges, I believe we have at the top of the README.md

A repostatus.org badge, Python versions supported

Agreed. My apologies. It's overlooked by me. Updated the review.

Regarding the following two badges, I believe we have at the top of the README.md

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[ ] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [x] A repostatus.org badge,
- [x] Python versions supported,
- [x] Current package version (on PyPI / Conda).

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[x] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2

Review Comments

Overall this is a very useful and interesting package, great job! Here are a few pointers:

The structure of the sanityze repository is very well organized and well documented. The ReadMe is especially informative and does a great job of introducing what the package does, why it is useful, and how to use it.
In your quickstart example in the ReadMe.md, you have the following line:

from sanityze import Cleanser, EmailSpotter

However this leads to an import error, as other reviewers have mentioned. Based on your script file names in src/sanityze, I believe the import instructions should be as follows:

from sanityze.cleanser import Cleanser
from sanityze.spotters import EmailSpotter

This way I have been able to successfully import the classes and functions.

It could also be useful for your quickstart example on the readme to have an example dataframe populated with data for the user to easily test out how the function works. However, the complete examples in the linked documentation are very clear and helpful.
The code is organized into scripts in a logical structure, well-documented, and easy to read.
You could add a codecov badge to show the percentage of code that is covered by your automated tests.

UBC-MDS / software-review-2023

Group-10-Sanityze #6

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Code of conduct

Editor and Review Templates

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments