Group 14 - BrokkR - Githubissues

name: BrokkR about: This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.

Submitting Author Name: Elena Ganacheva, Mike Guron, Daniel Merigo, Mehdi Naji Submitting Author Github Handle: !--author1-->@elenagan<!--end-author1-- Other Package Authors Github handles: (comma separated, delete if none) @mikeguron, @DMerigo, @mehdi-naji Repository: https://github.com/UBC-MDS/BrokkR Version submitted: 0.2.0 Submission type: Standard Editor: @flor14 Reviewers: TBD

Archive: TBD Version accepted: TBD Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: BrokkR
Title: Webscrape to DataFrame for Bag of Words
Version: 0.0.0.9000
Authors@R: 
    c(person("Elena", "Ganacheva", , "elena.ganacheva@gmail.com", role = c("aut", "cre")),
    person("Mike", "Guron", , "mike.guron21@gmail.com", role = c("ctb")),
    person("Daniel", "Merigo", , "dmerigos@gmail.com", role = c("ctb")),
    person("Mehdi", "Naji", , "mehdinaji@gmail.com", role = c("ctb")))
Description: This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead. 
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Config/testthat/edition: 3
Imports: 
    dplyr,
    polite,
    rvest,
    stringr,
    superml,
    testthat,
    tibble

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- [x] data retrieval
- [x] data extraction
- [x] data munging
- [ ] data deposition
- [ ] data validation and testing
- [x] workflow automation
- [ ] version control
- [ ] citation management and bibliometrics
- [ ] scientific software wrappers
- [ ] field and lab reproducibility tools
- [ ] database software bindings
- [ ] geospatial data
- [ ] text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences): The package retrieves data from urls provided by the user and extracts the text data from the webpage and then cleans the data and formats it as a dataframe with an option to have bag of words representation.
Who is the target audience and what are scientific applications of this package? Those who are new to webscraping and want a simple tool to collect text data from the internet for data analysis or machine learning purposes.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category? There are some libraries and packages that can facilitate this job, from scraping text from a URL to returning it to a bag of words (BOW). However, to the extent of our knowledge, there is no sufficiently handy and straightforward package for this purpose. This package is a tailored combination of Rvest and CountVectorizer. Rvest widely used to pull different sources of data from HTML and XML pages, and CountVectorizer is a well-known package to convert a collection of texts to a matrix of token counts
(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

[x] I have read the rOpenSci packaging guide.
[x] I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

[x] does not violate the Terms of Service of any service it interacts with.
[ ] has a CRAN and OSI accepted license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions, created with roxygen2.
[ ] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[ ] has continuous integration, including reporting of test coverage.

Publication options

[ ] Do you intend for this package to go on CRAN?
[ ] Do you intend for this package to go on Bioconductor?
[ ] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

MEE Options

- [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

[x] I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1.5 hours

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The package was easy to understand and install and is very symmetrical to the Python package, overall, the team did a good job with the quality of the code and the documentation. I have a few minor points that the team can consider in the future

I did not see any workflow badges in the README.md, but I think having them when you make it part of your portfolio is helpful to see the state of the repository when they check it out.
The README.md is light on details, and links to other helpful documents would be helpful, like links to the license file.
It would have been useful (but optional) to describe the architecture of the package and data flow for the contributors who are willing to help and add more features.
(Minor) I noticed you have many open branches but no active pull requests. Finally, great job putting this together; also, it's awesome to see that you have published the vignette on GH (https://ubc-mds.github.io/BrokkR/), that's going above and beyond.
Although the documentations are awesome, if you can a bit more about when would someone use the package (in the workflow of a data science project), that would be great.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[X] Vignette(s): demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions
[X] Examples: (that run successfully locally) for all exported functions
[X] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1.5 hours

[X] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The README.md does explain how to access the vignette, though including the link to the Github Pages website in this section might be helpful for users to view the examples directly without having to search for them.
brok_scrape() seems to be missing its associated test file under the tests/testthat folder.
There are no badges on the README.md indicating the CI/CD, (R-CMD-check), test coverage or docs building status.
Each functions is very straightforward to use, though I am a bit confused by some of their outputs as they seem to be running tests whenever the functions are called? I do not think this is the intended behaviour as the functions should be returning the tibbles/vectors directly. If it is intentional, I would explain what these tests mean in your documentation so that users can know what will pass or fail.
The following output seems unexpected/differs from your Python package:

I would be cautious of including these outputs directly with bow() which is responsible for parsing the text for word counts. You may not be interested in capturing these instances of "function", "date", "event" or "start" for example. I would try to parse the strings that are scraped so that these elements can be removed before further processing.

Overall, well done. I was not aware of a web scraping package within R so this seems like it could have cool applications combining NLP and statistical analysis. I'm interested in seeing how your package evolves from our feedback!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1h

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

Overall, well done on the R package! I ran check() and your package returns 0 warnings and 0 errors. I also like the detailed examples in your function manuals.
You might want to add badges as a quick overview of the status of your package: whether R-CMD-check passed and the coverage of your tests.
I noticed that there are some testthat calls inside the function (e.g. brok_scrape) for exception handling. It will be better if you could suppress the test passed output when no error occurred.
Include more information in README.md like contributing guidelines and license information to encourage community involvement.
After checking the test coverage report, I think you could increase your package's test coverage by writing some tests for the function brok_scrape.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.
[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need: clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s): demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions
[x] Examples: (that run successfully locally) for all exported functions
[x] Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 1 hr

[x] Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

As the python counterpart, this package is very interesting and would fit well within the R ecosystem.
I appreciate the package concept, and the documentation and repository were well-structured and intuitive.
I can see that the badges were added in the python project in the README file, but are missing in the R project. It would be nice to add the badges to the R project as well.
There are multiple stale branches in the repository. It would be a good idea to remove them once the pull requests are merged.
The documentation was easy to understand. However, It would be better to explicitly link the documentation(https://ubc-mds.github.io/BrokkR/) in the README as well.
Files like .Rhistory could be added to .gitignore so that it is not committed to the repository.
Similarly .DS_Store could be removed from the repo and added to .gitignore
A few of the sections are missing in the README file like contributing, license, etc.
One last thing would be that the CONTRIBUTING.md is not rendered properly in the top part. This could be fixed.

Overall, excellent work! It was a pleasure to review and learn from you!

UBC-MDS / software-review-2023

Group 14 - BrokkR #22

Archive: TBD Version accepted: TBD Language: en

Scope

Technical checks

Publication options

Code of conduct

Package Review

Documentation

Functionality

Review Comments

Package Review

Documentation

Functionality

Review Comments

Package Review

Documentation

Functionality

Review Comments

Package Review

Documentation

Functionality

Review Comments