UBC-MDS / software-review-2021

1 stars 1 forks source link

Submission: tweepyclean (Python) #3

Open calsvein opened 3 years ago

calsvein commented 3 years ago

Submitting Author: Nash Makhija (@nashmakh), Matt (@MattTPin), Syad Khan (@syadk), Cal Schafer (@calsvein) Package Name: tweepyclean One-Line Description of Package: ad-on functions to the tweepy package for twitter data processing, word counts and sentiment analysis Repository Link: https://github.com/UBC-MDS/tweepyclean/tree/0.3 Version submitted: 0.3 Editor: Tiffany Timbers (@ttimbers ) Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD


Description

tweepyclean is a Python package built to act as a processor of data generated by the existing Tweepy package that can produce clean data frames, summarize data, and generate new features.

Tweepy is a package built around Twitter's API and is used to scrape tweet information from their servers.

Our package creates functions to process the raw data from Tweepy into a more understandable format by extracting and organizing the contents of tweets for a user. tweepyclean is specifically built to be used in analysis of a specific user's timeline (generated using tweepy's api.user_timeline function). Users can visualize average engagement based on time of day posted, see basic summary statistics of word contents and sentiment analysis of tweets and have a processed dataset for usage in machine learning models.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

The tweepy package extracts tweet data, but it is not in a format that it is ready for analysis. Tweepyclean performs functions to convert tweepy extracted data into a machine-readable dataframe, performs feature engineering, and creates summary statistics and basic visualizations.

The audience is strictly intended for those who are already using the tweepy package and have a Twitter API key.

Not that we are aware of.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

anodaini commented 3 years ago

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 2


Review Comments

yuyanguo commented 3 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 2.5


Review Comments

Hi team,

I have been enjoying in reviewing your amazing package. Your package tweepyclean is very creative and interesting! Please find my comments below:

Installation

By running the installation command in the README, I got the following error:

ERROR: Could not find a version that satisfies the requirement textstat<0.8.0,>=0.7.0 (from tweepyclean) (from versions: 0.4.1, 0.5.0, 0.5.1, 0.5.2) ERROR: No matching distribution found for textstat<0.8.0,>=0.7.0 (from tweepyclean)

This is because some of your package dependecies are not on testPyPI. So I suggest you to add the --extra-index-url argument to pull the depencies from PyPI as follows: pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple tweepyclean

Features

The users of your package might not be super clear about how some functions in tweepy work, just like me. So it would be better if you can put a link for the function tweepy.Cursor() to explain the following sentence you mentioned in the Features section of the README: The ability to generate a dataframe from the a tweepy.cursor.ItemIterator object returned by calling tweepy.Cursor(api.user_timeline,id=username, tweet_mode='extended').items() with the tweepy package.

More importantly, I found that the description for the function sentiment_total() is inconsistent with the source code and docstring. In the README, it is said that it will return a line chart. But actually, the function only returns a dataframe.

Usage

I found the Usage section of the README is a little bit general and hard to understand. It would be better if you can include some actual examples for each function along with the outputs, so that the users can try themselves and get a complete picture of the package. First, I recommend including import tweepy to make the section more complete. Second, I recommend specifying the objects tweets, data, and clean_dataframe. Finally, as mentioned before, I would highly recommend showing some example output plots and output dataframes.

Similarly, the description for the function sentiment_total(data, lexicon) is inconsistent with the source code and docstrings.

Docstring & Documentation

Most of the docstrings are not rendered properly in the readthedocs. For example, I saw something like this: 3 x 5 sentiment word_count total_words <chr> <int> <dbl> anger 1 4 disgust 2 4 fear 1 4 negative 2 4 sadness 1 4 This is because the table you put in the docstring for the function sentiment_total() can not be rendered properly. Also, for lots of functions, the examples become sub-bullet points of the returns.

Moreover, the examples in your docstrings seem incomplete. For example, instead of just using#>>> raw_df(tweets), I suggest you define the object tweet in the example as well.

Tests

I suggest writing more tests for the function raw_df(). For example, you can test if the output is a dataframe object or not. Also, you might call the function tweepy.Cursor() and create a tweepy.cursor.ItemIterator object for your tests.

Similarly, I suggest you add one more output test for the function sentiment_total(). The expected output is a dataframe, so it is totally feasible to test if the output is exactly what we want.

Functionality

I have tried to run each of the functions based on the test file. Except for the function raw_df() (I am not sure how I should get the input for this function), everything works as expected. This is great!

Overall, you did a great job in putting all of these together. Thanks for all the hard works. I hope my suggestions can help to improve your package in the future.