Submission Group 21: pytextprep (Python)

Submitting Authors: Philson Chan (@PhilsChan)
Melisa Maidana (@mmaidana24318) Arijeet Chatterjee (@arijc76) Joshua Sia (@joshsia)

Package Name: pytextprep One-Line Description of Package: Python package that offers additional text preprocessing functionality specifically designed for tweets Repository Link: https://github.com/UBC-MDS/pytextprep Version submitted: v1.0.5 Editor: TBD

Reviewer 1: Luke Collins (@LukeAC) Reviewer 2: Kyle Ahn (@AraiYuno) Reviewer 3: Tianwei WANG () Reviewer 4: Linhan Cai (@lipcai)

Description

Include a brief paragraph describing what your package does:

This is a Python package that offers additional text preprocessing functionality specifically designed for tweets. The package bundles functions to help with cleaning and gaining insight into tweet data, providing additional resources for EDA or enabling feature engineering.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [x] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [x] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

This package provides functions to clean text data in tweets, extract hashtags and visualize word clouds from the tweets.
Who is the target audience and what are scientific applications of this package?

Any data analyst or developer who needs to analyze twitter data.
Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Another package that is available for twitter data is tweet_preprocessor. However pytextprep offers more functionality in terms of feature engineering and data visualization.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[ ] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

OS: Windows 10

Could not perform primary method of installation - encountered the following error ERROR: Failed building wheel for wordcloud

Could not perform secondary method of installation - encountered the following error

git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

[ ] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

30 minutes.

Review Comments

Summary: Looks like a very cool package - unfortunately wasn't able to test it out due to some issues with installation. Will circle back to this once resolved!

It looks as though there is an unused/empty python script that could be removed to clean up the project source code. https://github.com/UBC-MDS/pytextprep/blob/main/src/pytextprep/pytextprep.py
Installation failed on Windows 10 following primary installation instructions (i.e. pip install pytextprep) due to dependency wordcloud.
Installation failed on Windows 10 following secondary installation instructions. Public key to git repo not accepted.
Perhaps include additional badge to the top of README indicating state of code coverage
I think it could be beneficial to group similar functions (e.g. extract_hashtags, extract_ngrams) into single python modules. This would reduce the number of import statements required when using this package.
Based on the example provided in the README, it's unclear to me whether the manual data processing steps (extract hashtag, remove punctuation, etc) are required in order to generate a word cloud. If so, would it be possible to handle this text processing in the word-cloud-generating function itself, so that all the user would need to do to produce a word cloud is simply call the word-cloud function?

@LukeAC Thanks for your feedback!

Regarding the installation failure, we've updated our README to include clearer instructions on how to use our package. We think that the failure could be because pip install wordcloud was not working for some reason. As a workaround, you can install wordcloud using conda install -c conda-forge wordcloud -y in the virtual environment and then run pip install pytextprep.

Let us know if this works for you!

Also, to generate a word cloud, you do not need to pass the tweets through the other functions first. It's all handled in the word cloud function itself.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).