Open mikeguron opened 1 year ago
Template from : https://www.pyopensci.org/software-peer-review/how-to/reviewer-guide.html
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing:
It was a pleasure for me to look through your project and repository; not only is it straightforward and simple to use, but it also provides helpful information on how to get started. The following is some criticism as well as some questions that I hope you may take into consideration:
In general, I believe that the repository is straightforward to navigate and simple to extract information from.
Note: Shaun and I have swapped peer review groups!
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 1.5 hours
I think either including the output of each of your functions in the README.md
(or linking to the vignette example.ipynb
) would help to illustrate the overall design of your package more clearly and succinctly.
While there is a link to the ReadTheDocs page for the package in the repository description, it would be nice to include this in the README.md
since users will likely be referring to this document for installation/usage instructions.
I like that a disclaimer was included to ensure that users only apply this package on suitable websites, though it could be nice to include tests that would verify that a website can be scraped (checking for specific error messages) or provide some guidelines/examples of sites that do not allow for web scraping.
I think create_id()
seems more suitable as a helper function (not that useful for an analysis on its own). In that sense, I feel that it could be packaged together with duster()
and bow()
unless it has another application? It's not too clear to me why the ID should contain some metadata (maybe some examples of grouping by the ID or using it for something more broad than just extracting the site name and order of appearance in the URL list).
I also liked that you guys created your own test sites/HTMLs for testing. It could be good to include some tests on real examples and/or edge cases (text_from_url()
and duster()
are checking for the output types only but not whether the outputs are actually as expected).
Overall, nicely done! I think your package has many different applications and has room to expand to adopt supplementary features. Great work.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
example.ipynb
. Alternatively, you can add a link in README.md to the example notebook in Read the Docs.pyBrokk/src/pybrokk/
, you might want to delete the extra .py
files (e.g. create_id.py
) for the individual functions since you already put all the functions in pybrokk.py
to avoid having redundant codes in your package..py
module rather than the functions in pybrokk.py
. For example, from pybrokk.bow import bow
should be from pybrokk.pybrokk import bow
.bow
function is adding the option to return the sparse representation from the count vectorizer. Since the raw text of a single webpage could be very long and the result will have many columns. Having the option to either have the data frame or a sparse matrix returned is beneficial since a sparse matrix is more memory efficient.Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 1.5 hr
One last thing would be that the CONTRIBUTING.md is not rendered properly in the top part. This could be fixed.
Overall, excellent work! It was a pleasure to review and learn from you!
Submitting Author: Daniel Merigo (@DMerigo), Elena Ganacheva (@elenagan), Mike Guron (@mikeguron), Mehdi Naji (@mehdi-naji) All current maintainers: (@DMerigo, @elenagan, @mikeguron, @mehdi-naji) Package Name: pyBrokk One-Line Description of Package: A package for web-scraping a list of webpages and extracting text data into a dataframe Repository Link: https://github.com/UBC-MDS/pyBrokk Version submitted: v1.0.0 Editor: @flor14 Reviewer 1: Yurui Feng Reviewer 2: SNEHA Reviewer 3: Tony Zoght Reviewer 4: Shaun Hutchinson Archive: TBD
Version accepted: TBD Date accepted (month/day/year): TBD
Description
This package allows users to provide a list of URLs for webpages of interest and creates a dataframe with Bag of Words representation that can then later be fed into a machine learning model of their choice. Users also have the option to produce a dataframe with just the raw text of their target webpages to apply the text representation of their choice instead.
Scope
The package retrieves data from urls provided by the user and extracts the text data from the webpage and then cleans the data and formats it as a dataframe with an option to have bag of words representation.
Those who are new to web scraping and want a simple tool to collect text data from the internet for use in data analysis or machine learning processes
There are some libraries and packages that can facilitate this job, from scraping text from a URL to returning it to a bag of words (BOW). However, to the extent of our knowledge, there is no sufficiently handy and straightforward package for this purpose. This package is a tailored combination of BeatifulSoup and CountVectorizer. BeautifulSoup widely used to pull different sources of data from HTML and XML pages, and CountVectorizer is a well-known package to convert a collection of texts to a matrix of token counts.
@tag
the editor you contacted:Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
Please fill out our survey
P.S. *Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.