UBC-MDS / software-review-2023

DSCI 524
0 stars 0 forks source link

Group 14 - BrokkR #22

Open elenagan opened 1 year ago

elenagan commented 1 year ago

name: BrokkR about: This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.


Submitting Author Name: Elena Ganacheva, Mike Guron, Daniel Merigo, Mehdi Naji Submitting Author Github Handle: !--author1-->@elenagan<!--end-author1-- Other Package Authors Github handles: (comma separated, delete if none) @mikeguron, @DMerigo, @mehdi-naji Repository: https://github.com/UBC-MDS/BrokkR Version submitted: 0.2.0 Submission type: Standard Editor: @flor14 Reviewers: TBD

Archive: TBD Version accepted: TBD Language: en

Package: BrokkR
Title: Webscrape to DataFrame for Bag of Words
Version: 0.0.0.9000
Authors@R: 
    c(person("Elena", "Ganacheva", , "elena.ganacheva@gmail.com", role = c("aut", "cre")),
    person("Mike", "Guron", , "mike.guron21@gmail.com", role = c("ctb")),
    person("Daniel", "Merigo", , "dmerigos@gmail.com", role = c("ctb")),
    person("Mehdi", "Naji", , "mehdinaji@gmail.com", role = c("ctb")))
Description: This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead. 
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
Config/testthat/edition: 3
Imports: 
    dplyr,
    polite,
    rvest,
    stringr,
    superml,
    testthat,
    tibble

Scope

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

MEE Options - [ ] The package is novel and will be of interest to the broad readership of the journal. - [ ] The manuscript describing the package is no longer than 3000 words. - [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see [MEE's Policy on Publishing Code](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/journal-resources/policy-on-publishing-code.html)) - (*Scope: Do consider MEE's [Aims and Scope](http://besjournals.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)2041-210X/aims-and-scope/read-full-aims-and-scope.html) for your manuscript. We make no guarantee that your manuscript will be within MEE scope.*) - (*Although not required, we strongly recommend having a full manuscript prepared when you submit here.*) - (*Please do not submit your package separately to Methods in Ecology and Evolution*)

Code of conduct

tzoght commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1.5 hours


Review Comments

The package was easy to understand and install and is very symmetrical to the Python package, overall, the team did a good job with the quality of the code and the documentation. I have a few minor points that the team can consider in the future 

  1. I did not see any workflow badges in the README.md, but I think having them when you make it part of your portfolio is helpful to see the state of the repository when they check it out.
  2. The README.md is light on details, and links to other helpful documents would be helpful, like links to the license file.
  3. It would have been useful (but optional) to describe the architecture of the package and data flow for the contributors who are willing to help and add more features.
  4. (Minor) I noticed you have many open branches but no active pull requests. Finally, great job putting this together; also, it's awesome to see that you have published the vignette on GH (https://ubc-mds.github.io/BrokkR/), that's going above and beyond.
  5. Although the documentations are awesome, if you can a bit more about when would someone use the package (in the workflow of a data science project), that would be great.
lzung commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1.5 hours


Review Comments

I would be cautious of including these outputs directly with bow() which is responsible for parsing the text for word counts. You may not be interested in capturing these instances of "function", "date", "event" or "start" for example. I would try to parse the strings that are scraped so that these elements can be removed before further processing.

Overall, well done. I was not aware of a web scraping package within R so this seems like it could have cool applications combining NLP and statistical analysis. I'm interested in seeing how your package evolves from our feedback!

Yurui-Feng commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1h


Review Comments

  1. Overall, well done on the R package! I ran check() and your package returns 0 warnings and 0 errors. I also like the detailed examples in your function manuals.
  2. You might want to add badges as a quick overview of the status of your package: whether R-CMD-check passed and the coverage of your tests.
  3. I noticed that there are some testthat calls inside the function (e.g. brok_scrape) for exception handling. It will be better if you could suppress the test passed output when no error occurred.
  4. Include more information in README.md like contributing guidelines and license information to encourage community involvement.
  5. After checking the test coverage report, I think you could increase your package's test coverage by writing some tests for the function brok_scrape.
snesunil commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 1 hr


Review Comments

Overall, excellent work! It was a pleasure to review and learn from you!