Submission Group 21: textprepr (R)

Package Name: textprepr

One-Line Description of Package: Text preprocessing functions specifically designed for tweet data.

Submitting Author Name/ Github Handle:

Arijeet Chatterjee @arijc76,
Joshua Sia @joshsia,
Melisa Maidana @mmaidana24318,
Philson Chan @PhilsChan

Repository: https://github.com/UBC-MDS/textprepr

Version submitted: v1.0.0

Submission type: Standard

Editor: @arijc76, @joshsia, @mmaidana24318, @PhilsChan

Reviewers:

Luke Collins (@LukeAC)
Kyle Ahn (@AraiYuno)
Tianwei Wang ()
Linhan Cai (@lipcai)
Language: en

Package: textprepr
Title: Performs Pre-Processing of Tweets
Version: 0.0.0.9000
Authors@R: 
    person(given = "Arijeet",
           family = "Chatterjee",
           role = c("aut", "cre"),
           email = "arijc@student.ubc.ca")
    person(given = "Joshua",
           family = "Sia",
           role = c("aut", "cre"),
           email = "joshuasia2000@gmail.com")
    person(given = "Melisa",
           family = "Maidana",
           role = c("aut", "cre"),
           email = "placeholder@student.ubc.ca")
    person(given = "Philson",
           family = "Chan",
           role = c("aut", "cre"),
           email = "philsonchan@gmail.com")
Description: Functions which offer additional text preprocessing functionality
    specifically designed for tweets.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Suggests: 
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Imports: 
    wordcloud,
    stringr,
    RColorBrewer,
    purrr,
    stopwords

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [X] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [X] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences): The package bundles functions to help with cleaning and gaining insight into tweet data, providing additional resources for EDA and enabling feature engineering.
Who is the target audience and what are scientific applications of this package? This package is for people interested in performing data analysis on Tweeter data.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are no similar R packages available.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? N/A
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted. N/A
Explain reasons for any pkgcheck items which your package is unable to pass. N/A

Technical checks

Confirm each of the following by checking the box.

[X] I have read the guide for authors and rOpenSci packaging guide.

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[ ] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions, created with roxygen2.
[] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

1 hour

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Unit tests run/verified via local instance of package repo.

Summary: Really cool package! I did not run into any issues with installation.

I noticed the extract_ngram function might be missing some possible n-grams. For example: > textprepr::extract_ngram(c("one", "two", "three", "four"), n=2) [1] "one two" "two three" "three four" Is it anticipated that "four one" also be a valid n-gram returned by this function?
Could it potentially be worthwhile stripping numbers (in addition to punctuation) from tweet/text data?
It would be cool to integrate this package with one of the MDS groups whose package focuses on querying Twitter for tweet data. That way we could get an idea for how a 'real' wordcloud would look with real data.
Excellent function documentation and examples given for how to use the functions!
Could be worthwhile to delete unused/old branches which are no longer active ¯\_(ツ)_/¯. Not required for any functional reason, it's just good practice to keep project branching well organized.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 2 hours

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

A cool and useful package!

Here are some suggestions:

There are repeated if statement in the extract_ngram.R file

if (length(tweets) < n) {
stop("length of ngrams should be less than number of words in vector of tweets")
}
if(!is.character(tweets)) {
stop("input should be a character vector")
}

It will be better to add more explaining comments in the code block of the function.
If something like "#s#AusOpen" contained in the tweets, the extract_hashtags() function should return s#AusOpen or sAusOpen?
It will be better to provide more different data type test cases in test-extract_ngram.R.
It will be better to show some badges.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 35 minutes

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

First of all congratulations on creating a wonderful and useful package. The team has done a great job and I really found the documentation, docstrings and examples to be very good. They easily guided me through the process of working with your package. I have added my recommendations below as minor changes I feel could make this already very good package slightly better:

It's better to include code maintainer's email information in the Contributing.md file. People who are interested in contributing need to know who to contact.
It would be useful to have a visualization tools for the plots, as you say gaining insight into tweet data, a math plot could be much helpful. This would require significantly more work to build and test, so its understandable to keep things simple, but it's certainly an opportunity for improvement.
Your ReadMe does not contain a code badge, which would be nice to display since its obvious you put in a lot of work to write your tests and your coverage is quite good!
Having brief examples in the README.md file can be helpful for people who want to quickly get an idea of what package does through simple examples. It's better to show example output in the readme file.
The name of the functions are very informative.

In general, this is great work and I enjoyed using your package!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.: NONE
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[ ] Function Documentation: for all exported functions
[ ] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

extract_hashtags() function does not seem to be handling special characters or multiple #s.
No badges are shown in README.md. I believe showing the CI & CD badges will be very beneficial to give some confidence to the users to use the package because missing the badges could imply the lack of maintenance
Great work separating the 4 different functions into 4 different files for both the implementation and unit tests. It does not matter whether you have one file or 4, but it is important to follow the same patterns.
I am not able to find the example usage of some functions. I can only find the example usage for remove_punct() function in the documentation website.
Great unit test for test-generate_cloud.R. The unit tests surely look to be testing not only the input parameters but also the behaviour of the function. This would prevent someone to accidentally modify the behaviour of the function in a wrong way!

Team 21 is rocking! Great work :>

UBC-MDS / software-review-2022