Group 19 Submission: textfeatureinfor (R)

name: textfeatureinfor about: This R package sxtract information from text features which can be useful for feature engineering, or in other data science projects

Submitting Authors:

Jacqueline Chong (@Jacq4nn)
Kiran Phaterpekar (@kphaterp)
Lynn Wu (@lynnwbl)
Paniz Fazlali (@paradise1260)

Repository: textfeatureinfor Version submitted: 0.0.0.9 Submission type: Standard Editor: RB Reviewers:

Khalid Abdilahi (@khalidcawl)
Mao Lisheng (@nickmao1994)
Joshua Sia (@joshsia)
Nico Van den Hooff (@nicovandenhooff)

Archive: TBD Version accepted: TBD

Package: textfeatureinfor
Title: Text Features
Version: 0.0.0.9000
Authors@R: 
    c(person(given = "Lynn",
           family = "Wu",
           role = c("aut", "cre"),
           email = "lynnwbl@gmail.com"),
    person(given = "Kiran",
           family = "Phaterpekar",
           role = "aut",
           email = "kphaterp@student.ubc.ca"),
    person(given = "Jacqueline",
           family = "Chong",
           role = "aut",
           email = "jacqann@student.ubc.ca"),
    person(given = "Paniz",
           family = "Fazlali",
           role = "aut",
           email = "paniz.fazlali@gmail.com"))
Description: Package to extract interesting details about text.
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.2
Imports: 
    rapportools,
    stopwords,
    stringr,
    stringi
Suggests: 
    testthat (>= 3.0.0)
Config/testthat/edition: 3

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [x] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Our package aims to allow users to retrieve the number of punctuations, calculate the average word length, count the percentage of fully capitalised words, and to remove stopwords from a text.

Who is the target audience and what are scientific applications of this package?

Data scientist and casual programmers that would like to execute basic text feature engineering with fewer lines of code.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

Yes. textfeatures, qdap and stopwords are some of the well-established packages. Our package aims to combine simplify common text featuring engineering steps into a function, to reduce the number of lines of code.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the guide for authors and rOpenSci packaging guide.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[ ] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[ ] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[ ] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[ ] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s): demonstrating major functionality that runs successfully locally
[ ] Function Documentation: for all exported functions
[ ] Examples: (that run successfully locally) for all exported functions
[ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[ ] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[ ] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

[ ] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[ ] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1.5 hours

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Running avg_word_len("it's me") results in an output of 3 which is unexpected. I am not sure what is causing this problem. I would have expected either 1.66 if the three "words" are "it", "s" and "me", or 2.5 if the words are "its" and "me".
Running perc_cap_words("I") returns 0 which is unexpected behaviour. This could be because in the function, you have stringr::str_count(text, "\\b[A-Z]{2,}\\b") which looks for words that contain at least 2 characters. Thus, running perc_cap_words("I AM A BOY") returns 50 instead of 100.
It would be great to add automated testing, which I'm sure you will include soon!
It would be nice to add the Contributing and License sections to the README so that it is clear how you want other people to work on the package.
It would be nice to have a vignette which demonstrates use of the functions in a single file and maybe even host it online.
In the future, I think it would be nice to include functions that compute the percentage of words in all lower case, and maybe the median word length.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing:

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Well done team! I am sure the project is in progress and you are improving it as I write this review. Here are my comments:

Adding customizability to the list of punctuations, allowing the user to specify their own (or additional) punctuations in https://github.com/UBC-MDS/textfeatureinfor/blob/b863b56bab2c00b82c8a05bd8545096d9958b5bd/R/textfeatureinfor.R#L22
Instead of duplicating the list of punctuation characters, maybe read it from a central place like a function or a global variable. This will make the code less error-prone or someone forgetting to update one of the variables when adding a character.
I don't think else here is necessary since you're returning from the function when the previous if condition is true: https://github.com/UBC-MDS/textfeatureinfor/blob/b863b56bab2c00b82c8a05bd8545096d9958b5bd/R/textfeatureinfor.R#L72
Include CONTRIBUTING and LICENSE in the README
Perhaps in the future you can add a function that counts percentage of words that exclamation marks. This could be useful for sentiment analysis projects.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors. None
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions
[X] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 30 minutes

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Overall well done, this package definitely seems useful for NLP tasks and could assist with feature engineering. I understand that the package is in progress so you may already be working on some of the comments below.

codecov badge says unknown
Rather than commenting out install.packages("devtools") in the installation section of the README you could say that the user needs to have this package installed to install your package this way
I like your checks in each of your functions in terms of checking the types, these are well done
As noted by khalid and josh it would be good to add contributing and license sections to your README
You could add your names to your readme as authors (similar to your python package)
You could consider combining all of your tests into one file if your package was to grow in size, since having separate files for the tests for function in that case would create a lot of files

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1hr

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Overall well done team! R does not seem to have library to extract punctuations so it might be a little bit harder for building the R package. Here are my comments:

In remove_stop_words function, you are using "stopwords-iso" as a reference of stop word. It would be good to mention it in the documentation.
The count_punc returns error when dealing "\". For example, it returns error when count_punc("\") and zero when count_punc("\\")
As mentioned by Khalid, repeating the list of punctuation in your code violates the dry principle.
I submitted the review quite late and I found avg_word_len("it's me.") now correctly returns 1.66667. Good job!

UBC-MDS / software-review-2022