Group 14 - pyBrokk - Githubissues

Submitting Author: Daniel Merigo (@DMerigo), Elena Ganacheva (@elenagan), Mike Guron (@mikeguron), Mehdi Naji (@mehdi-naji) All current maintainers: (@DMerigo, @elenagan, @mikeguron, @mehdi-naji) Package Name: pyBrokk One-Line Description of Package: A package for web-scraping a list of webpages and extracting text data into a dataframe Repository Link: https://github.com/UBC-MDS/pyBrokk Version submitted: v1.0.0 Editor: @flor14 Reviewer 1: Yurui Feng Reviewer 2: SNEHA Reviewer 3: Tony Zoght Reviewer 4: Shaun Hutchinson Archive: TBD
Version accepted: TBD Date accepted (month/day/year): TBD

Description

This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.

Scope

Please indicate which category or categories this package falls under:
- [x] Data retrieval
- [x] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*

Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

The package retrieves data from urls provided by the user and extracts the text data from the webpage and then cleans the data and formats it as a dataframe with an option to have bag of words representation.

Who is the target audience and what are scientific applications of this package?

Those who are new to web scraping and want a simple tool to collect text data from the internet for use in data analysis or machine learning processes

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are some libraries and packages that can facilitate this job, from scraping text from a URL to returning it to a bag of words (BOW). However, to the extent of our knowledge, there is no sufficiently handy and straightforward package for this purpose. This package is a tailored combination of BeatifulSoup and CountVectorizer. BeautifulSoup widely used to pull different sources of data from HTML and XML pages, and CountVectorizer is a well-known package to convert a collection of texts to a matrix of token counts.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Please fill out our survey

[ ] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Review of pyBrokk

Template from : https://www.pyopensci.org/software-peer-review/how-to/reviewer-guide.html

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x ] The package name
[ x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [ ] Current package version (on PyPI / Conda).

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file. The jupyter notebook is there, but it would be nice to add a link from README
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x ] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass) - I don't see a Python linter run https://github.com/UBC-MDS/pyBrokk/blob/main/.github/workflows/ci-cd.yml

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software
[x] Authors: A list of authors with their affiliations
[x] A statement of need clearly stating problems the software is designed to solve and its target audience.
[x] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

1 hour

Review Comments

It was a pleasure for me to look through your project and repository; not only is it straightforward and simple to use, but it also provides helpful information on how to get started. The following is some criticism as well as some questions that I hope you may take into consideration:

It would be very helpful if you could add explicit links in the README.md file to the artifacts that you have made. If you could do this, that would be fantastic. For example, a link to the Jupyter Notebook sample (which is quite useful) and a link to the documentation. I am aware that it may be accessed through the badge; however, not all developers are aware of this fact.
I was unable to determine the location to which this package is published; therefore, a badge or link to PyPI would be of great assistance.
With regard to the architecture of the API, I have just one comment. I don't understand the purpose of the create id() function, and I don't see why a developer would use it by itself. It would be useful if you could describe it in the documents, if you could.
The version number is still 0.0.10; perhaps this was done on purpose.
Although the documentations are awesome, if you can a bit more about when would someone use the package (in the workflow of a data science project), that would be great

In general, I believe that the repository is straightforward to navigate and simple to extract information from.

Note: Shaun and I have swapped peer review groups!

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[X] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[X] Vignette(s) demonstrating major functionality that runs successfully locally.
[X] Function Documentation: for all user-facing functions.
[X] Examples for all user-facing functions.
[X] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[X] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[X] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[X] The package name
[X] Badges for:
- [X] Continuous integration and test coverage,
- [X] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [ ] Current package version (on PyPI / Conda).

[X] Short description of package goals.
[X] Package installation instructions
[X] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [ ] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[ ] Link to your documentation website.
[X] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[X] Citation information

Usability

[X] Package documentation is clear and easy to find and use.
[X] The need for the package is clear
[X] All functions have documentation and associated examples for use
[X] The package is easy to install

Functionality

[X] Installation: Installation succeeds as documented.
[X] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[X] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[X] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[X] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [X] Package supports modern versions of Python and not End of life versions.
- [X] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hours

Review Comments

I think either including the output of each of your functions in the README.md (or linking to the vignette example.ipynb) would help to illustrate the overall design of your package more clearly and succinctly.
While there is a link to the ReadTheDocs page for the package in the repository description, it would be nice to include this in the README.md since users will likely be referring to this document for installation/usage instructions.
I like that a disclaimer was included to ensure that users only apply this package on suitable websites, though it could be nice to include tests that would verify that a website can be scraped (checking for specific error messages) or provide some guidelines/examples of sites that do not allow for web scraping.
I think create_id() seems more suitable as a helper function (not that useful for an analysis on its own). In that sense, I feel that it could be packaged together with duster() and bow() unless it has another application? It's not too clear to me why the ID should contain some metadata (maybe some examples of grouping by the ID or using it for something more broad than just extracting the site name and order of appearance in the URL list).
I also liked that you guys created your own test sites/HTMLs for testing. It could be good to include some tests on real examples and/or edge cases (text_from_url() and duster() are checking for the output types only but not whether the outputs are actually as expected).

Overall, nicely done! I think your package has many different applications and has room to expand to adopt supplementary features. Great work.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[ ] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [ ] Current package version (on PyPI / Conda).

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[x] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[ ] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [ ] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1h

Review Comments

Overall, I really like your project and the motivation behind it. The concept of converting the pipeline that goes from raw html pages to tokenized word embedding representation into a single package is very attractive; especially with the blooming popularity of huge NLP models these days.
You might want to update the example usage in README.md as they are different from the usage in example.ipynb. Alternatively, you can add a link in README.md to the example notebook in Read the Docs.
In your source folder pyBrokk/src/pybrokk/, you might want to delete the extra .py files (e.g. create_id.py) for the individual functions since you already put all the functions in pybrokk.py to avoid having redundant codes in your package.
You might want to change the import for your test scripts. I think your test scripts are still testing the function inside each individual .py module rather than the functions in pybrokk.py. For example, from pybrokk.bow import bow should be from pybrokk.pybrokk import bow.
A potential future improvement for your bow function is adding the option to return the sparse representation from the count vectorizer. Since the raw text of a single webpage could be very long and the result will have many columns. Having the option to either have the data frame or a sparse matrix returned is beneficial since a sparse matrix is more memory efficient.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README.
[x] Installation instructions: for the development version of the package and any non-standard dependencies in README.
[x] Vignette(s) demonstrating major functionality that runs successfully locally.
[x] Function Documentation: for all user-facing functions.
[x] Examples for all user-facing functions.
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a pyproject.toml file or elsewhere.

Readme file requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for:
- [x] Continuous integration and test coverage,
- [x] Docs building (if you have a documentation website),
- [ ] A repostatus.org badge,
- [ ] Python versions supported,
- [ ] Current package version (on PyPI / Conda).

[x] Short description of package goals.
[x] Package installation instructions
[x] Any additional setup required to use the package (authentication tokens, etc.)
[ ] Descriptive links to all vignettes. If the package is small, there may only be a need for one vignette which could be placed in the README.md file.
- [x] Brief demonstration of package usage (as it makes sense - links to vignettes could also suffice here if package description is clear)
[ ] Link to your documentation website.
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages in the scientific ecosystem.
[x] Citation information

Usability

[x] Package documentation is clear and easy to find and use.
[x] The need for the package is clear
[x] All functions have documentation and associated examples for use
[x] The package is easy to install

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
- [x] Package supports modern versions of Python and not End of life versions.
- [x] Code format is standard throughout package and follows PEP 8 guidelines (CI tests for linting pass)

For packages also submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: With DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5 hr

Review Comments

This package is really interesting and would fit well within the python ecosystem.
I appreciate the package concept, and the documentation and repository were well-structured and intuitive.
I did not encounter any issues while running the package and it was a smooth experience.
The documentation was easy to understand. However, It would be a good idea to explicitly link the documentation in the README as well just in case people miss checking out the About section.
There are multiple stale branches in the repository. It would be a good idea to remove them once the pull requests are merged.
The Usage section in the readthedocs documentation website could be formatted better as it all looks like a heading.
It would be better to make certain dependencies >= rather than ^ in pyproject.toml. For eg, python = "^3.9" could be changed to python = ">=3.9"
One last thing would be that the CONTRIBUTING.md is not rendered properly in the top part. This could be fixed.

Overall, excellent work! It was a pleasure to review and learn from you!

UBC-MDS / software-review-2023

Group 14 - pyBrokk #23

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Code of conduct

Please fill out our survey

Editor and Review Templates

Review of pyBrokk

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Estimated hours spent reviewing: 1h

Review Comments

Package Review

Documentation

Usability

Functionality

For packages also submitting to JOSS

Final approval (post-review)

Review Comments