UBC-MDS / software-review-2022

0 stars 0 forks source link

Submission Group 21: pytextprep (Python) #36

Open arijeetchatterjee opened 2 years ago

arijeetchatterjee commented 2 years ago

Submitting Authors: Philson Chan (@PhilsChan)
Melisa Maidana (@mmaidana24318) Arijeet Chatterjee (@arijc76) Joshua Sia (@joshsia)

Package Name: pytextprep One-Line Description of Package: Python package that offers additional text preprocessing functionality specifically designed for tweets Repository Link: https://github.com/UBC-MDS/pytextprep Version submitted: v1.0.5 Editor: TBD

Reviewer 1: Luke Collins (@LukeAC) Reviewer 2: Kyle Ahn (@AraiYuno) Reviewer 3: Tianwei WANG () Reviewer 4: Linhan Cai (@lipcai)


Description

This is a Python package that offers additional text preprocessing functionality specifically designed for tweets. The package bundles functions to help with cleaning and gaining insight into tweet data, providing additional resources for EDA or enabling feature engineering.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

LukeAC commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

OS: Windows 10

Could not perform primary method of installation - encountered the following error ERROR: Failed building wheel for wordcloud

Could not perform secondary method of installation - encountered the following error

git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Final approval (post-review)

Estimated hours spent reviewing:

30 minutes.

Review Comments

Summary: Looks like a very cool package - unfortunately wasn't able to test it out due to some issues with installation. Will circle back to this once resolved!

  1. It looks as though there is an unused/empty python script that could be removed to clean up the project source code. https://github.com/UBC-MDS/pytextprep/blob/main/src/pytextprep/pytextprep.py
  2. Installation failed on Windows 10 following primary installation instructions (i.e. pip install pytextprep) due to dependency wordcloud.
  3. Installation failed on Windows 10 following secondary installation instructions. Public key to git repo not accepted.
  4. Perhaps include additional badge to the top of README indicating state of code coverage image
  5. I think it could be beneficial to group similar functions (e.g. extract_hashtags, extract_ngrams) into single python modules. This would reduce the number of import statements required when using this package.
  6. Based on the example provided in the README, it's unclear to me whether the manual data processing steps (extract hashtag, remove punctuation, etc) are required in order to generate a word cloud. If so, would it be possible to handle this text processing in the word-cloud-generating function itself, so that all the user would need to do to produce a word cloud is simply call the word-cloud function?
joshsia commented 2 years ago

@LukeAC Thanks for your feedback!

Regarding the installation failure, we've updated our README to include clearer instructions on how to use our package. We think that the failure could be because pip install wordcloud was not working for some reason. As a workaround, you can install wordcloud using conda install -c conda-forge wordcloud -y in the virtual environment and then run pip install pytextprep.

Let us know if this works for you!

Also, to generate a word cloud, you do not need to pass the tweets through the other functions first. It's all handled in the word cloud function itself.

Davidwang11 commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 2 hours


Review Comments

A cool and useful package. I followed your suggestion to install the package in windows platform, and it did work.

Here are some suggestions:

  1. The extract_ngram() function does not check whether the n parameter is an integer.
  2. If something like "#s#AusOpen" contained in the tweets, the extract_hashtags() function should return s#AusOpen or sAusOpen?
  3. It will be better to show some more badges.
  4. It will be better to add more explaining comments in the code block of the remove_punct() function.
  5. It is a little bit confusing about the installation and usage in the README file. Both containing installation instructions.
lipcai commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1 hour and 25 minutes

Review Comments

First of all congratulations on creating a wonderful and useful package. The team has done a great job and I really found the documentation, docstrings and examples to be very good. They easily guided me through the process of working with your package. I have added my recommendations below as minor changes I feel could make this already very good package slightly better:

  1. It would be useful to have a visualization tools for the plots, as you say gaining insight into tweet data, a math plot could be much helpful. This would require significantly more work to build and test, so its understandable to keep things simple, but it's certainly an opportunity for improvement.
  2. Your docstrings are great, however the functions could probably use a few more comments to increase readability. Not a big issue but would be nicer to read to fully understand whats going on. For example, tweets : array_like, List of tweets; n : int,Length of n-grams to be created, this could be more specific.
  3. Your ReadMe does not contain a code coverage badge, which would be nice to display since its obvious you put in a lot of work to write your tests and your coverage is quite good!
  4. I think it could be beneficial to group similar functions (e.g. extract_hashtags, extract_ngrams) into single python modules. This one I also agree and it could be more easy to run and reduce the number of import statements.
  5. Since the package was built on NumPy, they need to be referred to in the citation section in README.
  6. References: It would be good to include attribution/references to the packages, research papers or MDS material used for implementing this project.

In general, this is great work and I enjoyed using your package!

AraiYuno commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1 hour 10 mins


Review Comments

The package looks well-structured and it was relatively easy for me to follow the installation instructions on Windows 11. I have some proposals to make the package even more appealing!

  1. What about separating the test files into 4 different files as you did in the src folder for the implementation? I was expecting 4 different testing files intuitively as you have separated the implementation files into four. Any thoughts there? image image
  2. extract_hashtags() function does not seem to be handling special characters or multiple #s. tweets_list = ["Make America Great Again! @DonalTrump", "It's a new day in ####^America"] Using the input above gave me this output: ['', '', '', '']
  3. What about adding more edge cases in the unit tests? When I look at test_extract_hashtags.py, I only see some simple basic test cases. I believe adding some common edge cases such as special characters after #, empty string, or None type would be really beneficial.
  4. empty Python file was listed as API Reference in Documentation. image image
  5. Return variables in DocString for generate_cloud.py is incorrect. In the docstring, only figure is specified as a return variable, but WordCloud object is also a return variable. image image

Great work team 21!!!