Submitting Author: Anita Li (@AnitaLi-0371), Elanor Boyle-Stanley (@eboylestanley), Junghoo Kim (@jkim222383), Ivy Zhang (@ssyayayy) Repository: https://github.com/UBC-MDS/coRPysprofiling-R/tree/v0.3.1 Version submitted: v0.3.1 Editor: Tiffany Timbers(@ttimbers ) Reviewers: TBD

Archive: TBD Version accepted: TBD

Paste the full DESCRIPTION file inside a code block below: [need to be updated once version is finalized]()

Package: coRPysprofiling
Title: R Package for EDA and EDV on text
Version: 0.0.0.9000
Authors@R: 
    c(person(given = "Anita",
             family = "Li",
             role = c("aut"),
             email = "anita.li.ubc@gmail.com"),
      person(given = "Elanor",
             family = "Boyle-Stanley",
             role = c("aut"),
             email = "elanor.boyle.stanley@gmail.com"),
      person(given = "Junghoo",
             family = "Kim",
             role = c("aut"),
             email = "jkim.222383@gmail.com"),
      person(given = "Ivy",
             family = "Zhang",
             role = c("aut", "cre"),
             email = "ivyzhang1017@hotmail.com"))
Description: coRPysprofiling is an open-source library designed to bring exploratory 
    data analysis and visualization to the domain of natural language processing. 
    Functions in the package will be used to provide some elementary statistics and 
    visualizations for a single text corpus or provide functions to compare multiple 
    corpora with each other.
License:  GPL-3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.1.1
Suggests: 
    testthat (>= 3.0.0),
    covr,
    knitr,
    rmarkdown
Config/testthat/edition: 3
Imports: 
    stopwords,
    tokenizers,
    word2vec,
    stringr,
    here,
    stringi,
    ggplot2,
    ggwordcloud,
    dplyr
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [X] data retrieval
- [ ] data extraction
- [ ] data munging
- [ ] data deposition
- [ ] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [X] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The core functionalities for coRPysprofiling are to provide elementary statistics and visualizations for a single text corpus, and to compare multiple corpora with each other. It can also download and load pretrained word2vector models from github repository.
Who is the target audience and what are scientific applications of this package?

The target audience can be talent acquisition specialists who want to quickly retrieve valuable information from resume texts or compare text from two resumes.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

To our knowledge, while wordcloud library generates wordcloud visualization for a given corpus, there is no general-purpose library for exploratory analysis and visualization of a text corpus in the R ecosystem. There are several advanced libraries for comparing similarities between different corpora: most notably, quanteda provides similarity comparison between large corpora using word embeddings. We believe that coRPysprofiling will provide some useful functionality for exploratory analysis and visualization and help bridge the gap between elementary text analysis to more sophisticated approaches utilizing word embeddings.
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Not applicable

Technical checks

Confirm each of the following by checking the box.

[X] I have read the guide for authors and rOpenSci packaging guide.

This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions, created with roxygen2.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, including reporting of test coverage using services such as Travis CI, Coveralls and/or CodeCov.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[X] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Hello! This is a great package and is easy to use. I believe it covers an audience that would really benefit from such a tool. Below is my review.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
- was unable to complete check(), see reviewer notes below
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines
- see reviewer notes below for items to improve

Estimated hours spent reviewing: 1.25 hours

[ ] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The package is well documented and well-organized. The audience is well-considered as the package output's simplicity fits well with the audience who would like simple visualization and summary outputs.

Additional commentary:

For the function corpora_compare and corpora_best_match, it would be helpful to note in the function documentation (and in the examples provided in the README), that it will download a pre-trained model if there isn't one available, as it does take a considerable amount of time (20 minutes). One option would be for the user to confirm that they would like to download the large file.
For the function corpora_compare, there is an optional argument metric that is used for the calculation of distance, but it doesn't seem like there are other possible options to be chosen. Perhaps this should not be an argument since if a user were to enter anything other than the default, it would not work. If there are other options for metrics, they should be noted in the documentation. Further examining the functions, it looks as though "euclidean" is an option, but is missing from the documentation.
For the function corpora_best_match, the number of times the model is loaded is related to how many words you are comparing. It would be more efficient to only load the model once regardless of the number of comparisons, in case a user provided a large number of words for comparison. Example provided below:

> corpora_best_match('flower',c('television','candy','plant'))
Downloaded model found. Loading downloaded model...
Downloaded model found. Loading downloaded model...
Downloaded model found. Loading downloaded model...
# A tibble: 3 x 2
  corpora    metric
  <chr>       <dbl>
1 candy       0.818
2 plant       0.846
3 television  0.972

> corpora_best_match('flower',c('television','candy'))
Downloaded model found. Loading downloaded model...
Downloaded model found. Loading downloaded model...
# A tibble: 2 x 2
  corpora    metric
  <chr>       <dbl>
1 candy       0.818
2 television  0.972

As part of Packaging guidelines, there is opportunity to clean up the code to improve readability, this includes:
- moving long strings to subsequent lines as some of the lines are wider than 100 characters (when running check())
- some of the variables used are not descriptive (e.g.: corpus_viz function df and df_30)
I attempted to run check(), but after 20 minutes my R Studio session aborted and so I was unable to complete the check. Based on how far I did get, no errors were encountered, and reviewing the Git Actions, the tests pass and the coverage is high, so I have no reservations here. It may be worthwhile to look for opportunities to help speed up the check().

Hi coRPysprofiling (R)! This is a very neat and great data retrieval and preliminary EDA idea!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples (that run successfully locally) for all exported functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Estimated hours spent reviewing: 1h

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

General comments:

Hi Team coRPysprofiling!

I really enjoyed reviewing you R package and this is very neat and compelling idea! This package is easy to use, and intriguing to play around with! I tried all functions and they all worked nicely.

On a side note, the Automated tests worked quite well for an acceptably speed for me and I understood that there are some pretrained models required to download before the tests could run properly, so I am happy and okay with how your unit tests are.

Specific parts:

Considering the target audience (talent acquisition specialists who want to quickly retrieve valuable information from resume texts or compare text from two resumes) who your team and the package intend to help, there are some suggestions I came up with that you might find helpful to know about at least:

for the corpora_best_match function, it might be even more straightforward to see if the function could output the variable name in addition to the actual texts associated with the variable, especially in the case of a very long-length text, such as a paragraph from a resume. Below is the example of very long-length output text with no variable name indicating which text belongs to which particular job description/resume.

`#> # A tibble: 6 x 2

> corpora metric

>

> 1 You have spent countless hours over the years solving hard problems in~ 0.0818

> 2 Direct and oversee an organization's sales policies, objectives and in~ 0.0980

> 3 Apply statistical and machine learning knowledge to specific business ~ 0.104

> 4 Administrative assistant duties and responsibilities include providing~ 0.128

> 5 Prepare balance sheets, profit and loss statements and other financial~ 0.170

> 6 Support Investment Advisors in providing superior customer service and~ 0.174`

in the corpora_best_match function, it might be less distracting for the users if the function could show no warning/message information in the output, especially when the users want to compare more than 20 different resume sentences (the function would first output more than 20 Downloaded model found. Loading downloaded model... messages). Below is the example of the warning/messages.

`corpora_best_match(mds, job_list, metric = "euclidean")

> Downloaded model found. Loading downloaded model...

> Downloaded model found. Loading downloaded model...`

this suggestion might be a little beyond the scope of what we have learned in the program so far, but I believe this would be a valuable addition for the intended audience. In addition to passing in string as input in corpus_viz, corpora_compare, and corpora_best_match, it would be even more benefitical and easy to use if users could directly input a .csv/word/pdf file as resumes are usually in these forms.

Overall you have done a phenomenal job! Rachel Xu

UBC-MDS / software-review-2021

coRPysprofiling (R) #39

Scope

Technical checks

Publication options

Code of conduct

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Review Comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Review Comments

> corpora metric

>

> 1 You have spent countless hours over the years solving hard problems in~ 0.0818

> 2 Direct and oversee an organization's sales policies, objectives and in~ 0.0980

> 3 Apply statistical and machine learning knowledge to specific business ~ 0.104

> 4 Administrative assistant duties and responsibilities include providing~ 0.128

> 5 Prepare balance sheets, profit and loss statements and other financial~ 0.170

> 6 Support Investment Advisors in providing superior customer service and~ 0.174`

> Downloaded model found. Loading downloaded model...

> Downloaded model found. Loading downloaded model...

> Downloaded model found. Loading downloaded model...

> Downloaded model found. Loading downloaded model...

> Downloaded model found. Loading downloaded model...

> Downloaded model found. Loading downloaded model...`