Submission: tweepyclean (Python)

Submitting Author: Nash Makhija (@nashmakh), Matt (@MattTPin), Syad Khan (@syadk), Cal Schafer (@calsvein) Package Name: tweepyclean One-Line Description of Package: ad-on functions to the tweepy package for twitter data processing, word counts and sentiment analysis Repository Link: https://github.com/UBC-MDS/tweepyclean/tree/0.3 Version submitted: 0.3 Editor: Tiffany Timbers (@ttimbers ) Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD

Description

Include a brief paragraph describing what your package does:

tweepyclean is a Python package built to act as a processor of data generated by the existing Tweepy package that can produce clean data frames, summarize data, and generate new features.

Tweepy is a package built around Twitter's API and is used to scrape tweet information from their servers.

Our package creates functions to process the raw data from Tweepy into a more understandable format by extracting and organizing the contents of tweets for a user. tweepyclean is specifically built to be used in analysis of a specific user's timeline (generated using tweepy's api.user_timeline function). Users can visualize average engagement based on time of day posted, see basic summary statistics of word contents and sentiment analysis of tweets and have a processed dataset for usage in machine learning models.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [x ] Data extraction
- [ ] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

The tweepy package extracts tweet data, but it is not in a format that it is ready for analysis. Tweepyclean performs functions to convert tweepy extracted data into a machine-readable dataframe, performs feature engineering, and creates summary statistics and basic visualizations.

Who is the target audience and what are scientific applications of this package?

The audience is strictly intended for those who are already using the tweepy package and have a Twitter API key.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Not that we are aware of.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x ] does not violate the Terms of Service of any service it interacts with.
[ x] has an OSI approved license.
[x ] contains a README with instructions for installing the development version.
[x ] includes documentation with examples for all functions.
[ x] contains a vignette with examples of its essential functions and uses.
[x ] has a test suite.
[x ] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[ ] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[ ] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[ ] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[ ] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2

Review Comments

The idea for the package is really exciting and your explanation for the package and the functions in the Readme is good.
The examples do not render properly in the docs. Try removing the # before the #>>> extra_cols(tweets_df) to be >>> extra_cols(tweets_df).
It will be nice if the examples in the docs are complete. Meaning they include the import statements, etc.
I could not find a full vignette example, neither in the readme, nor in the usage.rst. Please let me know if it is somewhere else and I will change this.
I tried to put together an example to test the functionality with no success.
I tried to install the package locally and I got this error: ERROR: Could not find a version that satisfies the requirement python-semantic-release<8.0.0,>=7.15.0 (from tweepyclean) (from versions: none) ERROR: No matching distribution found for python-semantic-release<8.0.0,>=7.15.0 (from tweepyclean)
I used pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple tweepyclean and that seemed to work.
I think that the package is more in the data munging (wrangling) category. See this
Overall, I think you guys did a great job with the amount of functionality you have determined to include in the package and I am hoping to see this package be used and maintained in the future :)

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the badge for pyOpenSci peer-review will be provided upon acceptance.)
[ ] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Usability

[x] The documentation is easy to find and understand
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use

Functionality

[ ] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2.5

Review Comments

Hi team,

I have been enjoying in reviewing your amazing package. Your package tweepyclean is very creative and interesting! Please find my comments below:

Installation

By running the installation command in the README, I got the following error:

ERROR: Could not find a version that satisfies the requirement textstat<0.8.0,>=0.7.0 (from tweepyclean) (from versions: 0.4.1, 0.5.0, 0.5.1, 0.5.2) ERROR: No matching distribution found for textstat<0.8.0,>=0.7.0 (from tweepyclean)

This is because some of your package dependecies are not on testPyPI. So I suggest you to add the --extra-index-url argument to pull the depencies from PyPI as follows: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple tweepyclean

Features

The users of your package might not be super clear about how some functions in tweepy work, just like me. So it would be better if you can put a link for the function tweepy.Cursor() to explain the following sentence you mentioned in the Features section of the README: The ability to generate a dataframe from the a tweepy.cursor.ItemIterator object returned by calling tweepy.Cursor(api.user_timeline,id=username, tweet_mode='extended').items() with the tweepy package.

More importantly, I found that the description for the function sentiment_total() is inconsistent with the source code and docstring. In the README, it is said that it will return a line chart. But actually, the function only returns a dataframe.

Usage

I found the Usage section of the README is a little bit general and hard to understand. It would be better if you can include some actual examples for each function along with the outputs, so that the users can try themselves and get a complete picture of the package. First, I recommend including import tweepy to make the section more complete. Second, I recommend specifying the objects tweets, data, and clean_dataframe. Finally, as mentioned before, I would highly recommend showing some example output plots and output dataframes.

Similarly, the description for the function sentiment_total(data, lexicon) is inconsistent with the source code and docstrings.

Docstring & Documentation

Most of the docstrings are not rendered properly in the readthedocs. For example, I saw something like this: 3 x 5 sentiment word_count total_words <chr> <int> <dbl> anger 1 4 disgust 2 4 fear 1 4 negative 2 4 sadness 1 4 This is because the table you put in the docstring for the function sentiment_total() can not be rendered properly. Also, for lots of functions, the examples become sub-bullet points of the returns.

Moreover, the examples in your docstrings seem incomplete. For example, instead of just using#>>> raw_df(tweets), I suggest you define the object tweet in the example as well.

Tests

I suggest writing more tests for the function raw_df(). For example, you can test if the output is a dataframe object or not. Also, you might call the function tweepy.Cursor() and create a tweepy.cursor.ItemIterator object for your tests.

Similarly, I suggest you add one more output test for the function sentiment_total(). The expected output is a dataframe, so it is totally feasible to test if the output is exactly what we want.

Functionality

I have tried to run each of the functions based on the test file. Except for the function raw_df() (I am not sure how I should get the input for this function), everything works as expected. This is great!

Overall, you did a great job in putting all of these together. Thanks for all the hard works. I hope my suggestions can help to improve your package in the future.

UBC-MDS / software-review-2021