UBC-MDS / software-review-2022

0 stars 0 forks source link

Submission Group 19: textfeatureinfo (Python) #3

Open Jacq4nn opened 2 years ago

Jacq4nn commented 2 years ago

Submitting Authors:

Package Name: textfeatureinfo One-Line Description of Package: Extract information from text features which can be useful for feature engineering, or in other data science projects Repository Link: textfeatureinfo Version submitted: 2.0.0 Editor: Florencia D'Andrea (@flor14)

Reviewers:


Description

Our package, textfeatureinfo, will help gather summary information from plain text such as the number of punctuations in the text, the average word lengths and the percentage of fully capitalised words which can be useful information for feature engineering. Additionally, our package can also manipulate text data by removing the stopwords for the ease of future processing steps.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Our package aims to allow users to retrieve the number of punctuations, calculate the average word length, count the percentage of fully capitalised words, and to remove stopwords from a text.

Data scientist and casual programmers that would like to execute basic text feature engineering with fewer lines of code.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Yes. SpaCy, ntlk, and genism are some of the well-established packages. Our package aims to combine simplify common text featuring engineering steps into a function, to reduce the number of lines of code.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's `[submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements)`: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Jacq4nn commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

joshsia commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1.5 hours


Review Comments

  1. The README file looks good, but I think it would be better to be a bit more explicit about where to find the documentation online rather than just including the docs badge.

  2. I think it would be nice to see the test coverage of your package by including the codecov badge.

  3. I found the function perc_cap_words interesting so I played around with it a bit and noticed some unexpected output (at least for me). Running perc_cap_words("W-O-R-L-D") results in 20.0 and running perc_cap_words("WOR-LD") results in 50.0. I would have expected the output to be 100 for both cases. This could be because you split the text by whitespace when counting count_cap_words but divided it by the total number of words counted by tokenizer.tokenize(text). It might be a better approach to tokenize the words using the tokenizer and then count the number of words that are in all caps.

  4. This could be up to design choice but when I run avg_word_len("it's me"), the function returns 1.666 because punctuations are replaced by a space. In this case, however, I would expect the output to be 2.5.

  5. I think it might be nice to explicitly let users know what characters are considered punctuation maybe in the function documentation.

khalidcawl commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

Great work team! I like the package idea and the motivation you provided for it in the README. Here is what I liked about your project:

Here are a few suggestions I would like to add:

Good work, and keep going!

nicovandenhooff commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Reviewer note: Section not applicable for this package

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1 hour


Review Comments

Overall well done, this package definitely seems useful for NLP tasks and could assist with feature engineering.

Some comments:

All the above are not significant comments and you guys have done a great job so far!

nickmao1994 commented 2 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 1 hr


Review Comments

I find this package very easy to understand and use. It automates many tedious works in NLP, especially when I want to do certain feature engineerings. Here are my comments after testing your package: