Open xXJohamXx opened 1 year ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
The simple quick start example does not work for me. e.g.
>> from sanityze import Cleanser, EmailSpotter Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name 'Cleanser' from 'sanityze' (/opt/miniconda3/envs/fxtracker/lib/python3.9/site-packages/sanityze/__init__.py)
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
function documentation could be longer.
Not straight forward to
add_spotter()
and not easy to guess the unique ID when performingremove_spotter()
.
Incorrect syntax for
assert
was used. e.g.assert(len(c.chain) > 0,"Cleanser should have at least one spotter in the chain")
It should be
assert len(c.chain) > 0, ("Cleanser should have at least one spotter in the chain")
TO BE TESTED
The number of characters on some lines exceeds the 79 characters limit suggested by PEP 8.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 2.5 hours
The design of the package is quite robust as it can support different types of spotter. i.e. future enhancement can be easily achieved. Nonetheless, there are a few observations from me:
clean
and after clean
. I could only find similar information in the docstring of the functions.flake8
can be included in ci
so as to have an automatic style check.cleaner.add_spotter(...)
- should be cleanser
(not cleaner
)getUID()
has been renamed to getSpotterUID()
. So the one in README is not up-to-date. ## Package Review
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
The simple quick start example does not work for me. e.g.
>> from sanityze import Cleanser, EmailSpotter Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: cannot import name 'Cleanser' from 'sanityze' (/opt/miniconda3/envs/fxtracker/lib/python3.9/site-packages/sanityze/__init__.py)
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
function documentation could be longer.
Not straight forward to
add_spotter()
and not easy to guess the unique ID when performingremove_spotter()
.
Incorrect syntax for
assert
was used. e.g.assert(len(c.chain) > 0,"Cleanser should have at least one spotter in the chain")
It should be
assert len(c.chain) > 0, ("Cleanser should have at least one spotter in the chain")
TO BE TESTED
The number of characters on some lines exceeds the 79 characters limit suggested by PEP 8.
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 2.5 hours
class Cleanser
, "It's purpose is to clean the data frame" should be "The purpose is to clean the data frame". There is no harm to do another check on the spelling.
- There is a simple quick start example in the README which is good. It would be better by providing a concrete example to show a dataframe's content before
clean
and afterclean
. I could only find similar information in the docstring of the functions.
Good suggestion, we will look into it.
Regarding the following two badges, I believe we have at the top of the README.md
A repostatus.org badge, Python versions supported
Agreed. My apologies. It's overlooked by me. Updated the review.
Regarding the following two badges, I believe we have at the top of the README.md
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 2
Overall this is a very useful and interesting package, great job! Here are a few pointers:
The structure of the sanityze
repository is very well organized and well documented. The ReadMe is especially informative and does a great job of introducing what the package does, why it is useful, and how to use it.
In your quickstart example in the ReadMe.md, you have the following line:
from sanityze import Cleanser, EmailSpotter
However this leads to an import error, as other reviewers have mentioned. Based on your script file names in src/sanityze
, I believe the import instructions should be as follows:
from sanityze.cleanser import Cleanser
from sanityze.spotters import EmailSpotter
This way I have been able to successfully import the classes and functions.
It could also be useful for your quickstart example on the readme to have an example dataframe populated with data for the user to easily test out how the function works. However, the complete examples in the linked documentation are very clear and helpful.
The code is organized into scripts in a logical structure, well-documented, and easy to read.
You could add a codecov badge to show the percentage of code that is covered by your automated tests.
Submitting Author: Name @tzoght All current maintainers: ( @tzoght, @xXJohamXx, @caesarw0) Package Name: Sanityze One-Line Description of Package: This package provides utilities to spot and redact PII from Pandas data frames. Repository Link: https://github.com/UBC-MDS/sanityze Version submitted: 0.1.3 Editor: @fdandrea Reviewer 1: Chenyang Wang Reviewer 2: Markus Nam
Reviewer 3: Marian Agyby Reviewer 4: Chen Li
Description
Data scientists often need to remove or redact Personal Identifiable Information (PII) from their data. This package provides utilities to spot and redact PII from Pandas data frames.
PII can be used to uniquely identify a person. This includes names, addresses, credit card numbers, phone numbers, email addresses, and social security numbers, and therefore regulatory bodies such as the European Union's General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require that PII be removed or redacted from data sets before they are shared an further processed.
Scope
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
Data munging: This package provides utilities to spot and redact PII from Pandas data frames.
Who is the target audience and what are scientific applications of this package?
Data scientists working with files that contain PII that will be used for analysis.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?
Yes, the closet Python package in functionality to sanityze is scrubadub which is a package for finding and removing PII from text. However, the package is not designed to work with Pandas data frames, or other data structures.
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
Editor and Review Templates
The editor template can be found here.
The review template can be found here.