UBC-MDS / software-review-2023

DSCI 524
0 stars 0 forks source link

Group 14 - pyBrokk #23

Open mikeguron opened 1 year ago

mikeguron commented 1 year ago

Submitting Author: Daniel Merigo (@DMerigo), Elena Ganacheva (@elenagan), Mike Guron (@mikeguron), Mehdi Naji (@mehdi-naji) All current maintainers: (@DMerigo, @elenagan, @mikeguron, @mehdi-naji) Package Name: pyBrokk One-Line Description of Package: A package for web-scraping a list of webpages and extracting text data into a dataframe Repository Link: https://github.com/UBC-MDS/pyBrokk Version submitted: v1.0.0 Editor: @flor14 Reviewer 1: Yurui Feng Reviewer 2: SNEHA Reviewer 3: Tony Zoght Reviewer 4: Shaun Hutchinson Archive: TBD
Version accepted: TBD Date accepted (month/day/year): TBD


This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.


Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

The package retrieves data from urls provided by the user and extracts the text data from the webpage and then cleans the data and formats it as a dataframe with an option to have bag of words representation.

Those who are new to web scraping and want a simple tool to collect text data from the internet for use in data analysis or machine learning processes

There are some libraries and packages that can facilitate this job, from scraping text from a URL to returning it to a bag of words (BOW). However, to the extent of our knowledge, there is no sufficiently handy and straightforward package for this purpose. This package is a tailored combination of BeatifulSoup and CountVectorizer. BeautifulSoup widely used to pull different sources of data from HTML and XML pages, and CountVectorizer is a well-known package to convert a collection of texts to a matrix of token counts.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

Please fill out our survey

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

tzoght commented 1 year ago

Review of pyBrokk

Template from : https://www.pyopensci.org/software-peer-review/how-to/reviewer-guide.html

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide


The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)


Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:


For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:

Review Comments

It was a pleasure for me to look through your project and repository; not only is it straightforward and simple to use, but it also provides helpful information on how to get started. The following is some criticism as well as some questions that I hope you may take into consideration:

  1. It would be very helpful if you could add explicit links in the README.md file to the artifacts that you have made. If you could do this, that would be fantastic. For example, a link to the Jupyter Notebook sample (which is quite useful) and a link to the documentation. I am aware that it may be accessed through the badge; however, not all developers are aware of this fact.
  2. I was unable to determine the location to which this package is published; therefore, a badge or link to PyPI would be of great assistance.
  3. With regard to the architecture of the API, I have just one comment. I don't understand the purpose of the create id() function, and I don't see why a developer would use it by itself. It would be useful if you could describe it in the documents, if you could.
  4. The version number is still 0.0.10; perhaps this was done on purpose.
  5. Although the documentations are awesome, if you can a bit more about when would someone use the package (in the workflow of a data science project), that would be great

In general, I believe that the repository is straightforward to navigate and simple to extract information from.

lzung commented 1 year ago

Note: Shaun and I have swapped peer review groups!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide


The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)


Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:


For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hours

Review Comments

Overall, nicely done! I think your package has many different applications and has room to expand to adopt supplementary features. Great work.

Yurui-Feng commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide


The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)


Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:


For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1h

Review Comments

snesunil commented 1 year ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide


The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)


Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:


For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hr

Review Comments

Overall, excellent work! It was a pleasure to review and learn from you!