Submission : nlpsummarizer (R)

name: nlpsummarizer about: R package generates the summary of textual dataframe.

Submitting Author: Vignesh Chandrasekaran (@vigchandra), Karlos Muradyan (@KarlosMuradyan), Karanpal Singh (@singh-karanpal) , Sam Chepal (@schepal) Repository: https://github.com/UBC-MDS/nlpsummarizer Version submitted: 1.1.1 Editor: Varada (@kvarada ) Reviewer 1: Jarome Leslie (@jsleslie) Reviewer 2: Chun Hin Trevor Kwan (@trevor77) Archive: TBD
Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below:


### Overview:

One of the most relevant applications of machine learning for corporations globally is the use of natural language processing (NLP). Whether it be parsing through business documents to gather key word segments or detecting Twitter sentiment of a certain product, NLP’s use case is prevalent in virtually every business environment.

Our library specifically will make extensive use of pre-existing packages in the R eco-system. We will use the textcat and openNLP library to build most of the sentiment analysis functions while also leveraging well-known packages such as tidyverse to aid in the overall presentation of our final output results.

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [ ] data retrieval
- [ ] data extraction
- [X] data munging
- [ ] data deposition
- [ ] workflow automataion
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] database software bindings
- [ ] geospatial data
- [X] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Unfortunately, there are few tools today which provide summary statistics on textual data that a user may want to analyze. Our goal with this package is to provide users with a simple and flexible tool to gather key insights that would be useful during the exploratory data analysis phase of the data science workflow.

Who is the target audience and what are scientific applications of this package?

Any data scientist who deals with textual data would be likely using this package to get quick summaries of the data that they would be dealing with.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

To the best of our knowledge, there is no any other package that combines all the below mentioned functionality in one.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Technical checks

Confirm each of the following by checking the box.

[X] I have read the guide for authors and rOpenSci packaging guide.

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions, created with roxygen2.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ X] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Options

- [ ] The package has an **obvious research application** according to [JOSS's definition](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: - (*Do not submit your package separately to JOSS*)

[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 3 hours including installation

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Nlpsummarizer does a great job in addressing a real need in machine learning applications particularly in the use of natural language processing because there currently not a lot of packages that provide summary statistics on textual data. This package primarily has 4 functions: detecting the language of the text, break the text into counts of parts of speech, break the text into polarity sentiment categories, and checking the proportion of sentences and stopwords.

Some user interface and documentation improvements could be made here. After installing the package, I followed the instructions on the README but was only able to run 2 out of the 4 functions. I would suggest changing the README to accurately reflect the real function names we should call. For example, the "get_language" function is really the "detect_language" function, and the "get_polarity" function is really the "polarity" function. After contacting one of the team members, I was successfully able to run all 4 functions using the example code provided. This was very well done! The examples were clear and easy to understand in the README.

Furthermore, the code does comply with general principles in the Mozilla reviewing guide in that the functions are as simple as possible, the code is efficient, the usage of each function is clear, and edge cases have been considered. I did not catch any code duplication in the package that should be reduced.

Overall, the functions were relatively well when you pass a variety of inputs, however, some performance improvements could be made. For example, for the detect_language function, when passing an integer as an input a NA value is returned, when it would be optimal to return an error message communicating to the user what the input problem was. When entering an acroynom as an input, the detect_language function was unable to detect English acroynoms as english, but rather detected it as Spanish.

In conclusion, really well done on creating a package that indeed is useful in the world of natural language processing. Although some improvements could be made in the README documentation and edge case coverage could be better for certain functions, nlpsummarizer was pleasure to work with and has a lot of potential to become a widely used package!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[x] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1 hour

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

I think the idea to create the NLPSummarizer package to support the analysis of text data is a good one. To my knowledge, the provided functions each offer a specific utility not currently present in R. This tool would therefore be quite useful for NLP-related analyses and will be something I keep in my mind when such a task arises.

In terms of improvements to be made, some attention can be placed on the README.md file. There are a few instances of proofreading errors which made it through to the current form. For example: (1) the package name in line 1 “NLPSummmarizer” should be “NLPSummarizer”; (2) the one-line description of the package is not a complete sentence; and (3) the first line in the installation section is not a complete sentence.

Another area of improvement is with the consistency of function names. The actual function names do not match up with the function calls in the function and example sections of the README.md file. Specifically, get_language() could be changed to detect_language(), get_polarity() could be changed to polarity() and summary_4() could be changed to sentence_stopwords_freq().

In terms of general functionality, all of the functions ran without errors and produce helpful output. The polarity() function provides the number of positive and negative words but does not include the number of neutral words mentioned in the example.

The get_part_of_speech() function actually does a lot more than what the example would suggest. This function returns twice as many parts of speech than the five shown in the example so this part of the documentation could be refined. From testing, it appears that the number of columns returns depends on the diversity of types in the input text.

Lastly, the function output of summary_4()or sentence_stopwords_freq() could be amended to exclude the high freq. words column which has not been implemented.

The above changes would add some polish to this package which already does a decent job in tackling the analysis of text data. Good job guys!

Thank you for the feedback everyone. As per your suggestions, we edited the documentation in our DESCRIPTION file to provide more clarity on the intended purpose of our functions. We also changed the name of summary_4() to sentence_stopword() as this is a more intuitive description for the function. Other minor typos and syntax errors were addressed to ensure the code works properly. Please see the final release here.

UBC-MDS / software-review