Submission: nlpsummarize (Python)

name: nlpsummarize about: Python package that provides a nice summary of all character columns in a pandas dataframe.

Submitting Author: Vignesh Chandrasekaran (@vigchandra), Karlos Muradyan (@KarlosMuradyan), Karanpal Singh (@singh-karanpal) , Sam Chepal (@schepal) Repository: https://github.com/UBC-MDS/nlpsummarize Version submitted: 1.1.0 Editor: Varada (@kvarada ) Reviewer 1: Shivam Verma (@vermashivam679) Reviewer 2: George Thio (@gptzjs ) Archive: TBD Version accepted: TBD

Description

One of the most relevant applications of machine learning for corporations globally is the use of natural language processing (NLP). Whether it be parsing through business documents to gather key word segments or detecting Twitter sentiment of a certain product, NLP’s use case is prevalent in virtually every business environment.

Our library specifically will make extensive use of pre-existing packages in the Python eco-system. We will use the nltk library to build most of the sentiment analysis functions while also leveraging well-known packages such as pandas to aid in the overall presentation of our final output results.

Scope

Please indicate which category or categories this package falls under:
- [ ] Data retrieval
- [ ] Data extraction
- [X] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [X] Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how the and why the package falls under these categories (briefly, 1-2 sentences):

Any data scientist who deals with textual data would be likely using this package to get quick summaries of the data that they would be dealing with.

Who is the target audience and what are scientific applications of this package?

Unfortunately, there are few tools today which provide summary statistics on textual data that a user may want to analyze. Our goal with this package is to provide users with a simple and flexible tool to gather key insights that would be useful during the exploratory data analysis phase of the data science workflow.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

To the best of our knowledge, there is no any other package that combines all the below mentioned functionality in one.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[X ] does not violate the Terms of Service of any service it interacts with.
[X] has an OSI approved license
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X ] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[ ] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[X] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[x] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[x] Direction to more detailed documentation (e.g. your documentation files or website).
[ ] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 2.5 hours

Review Comments

Hi Vignesh, Karanpal, Samneet and Karlos!

First of all, I think this is a great package idea which has a lot of utility in a field which I am really interested in to the point that I contemplated doing MDS-CL.

Please find below my review and feedback for your consideration.

Function documentation was concise, informative and effective at explaining the utility of the package which immediately resonated with me as I just had a conversation with my best friend this morning who is managing partner of a major law firm in Vancouver about how to incorporate ML to improve business intelligence and streamline lawyer workflow and productivity. So immediately thought of nlpsummarize!
Package Installation was quick and easy and I was able to run your good mix of tests and tests of my own in English, French, and Simplified Chinese (didn't expect that to work!) once I was able to install all the dependencies, but encountered some minor issues which are noted below.
Suggest amending the README instruction under Dependencies as shown below to specify that the file needs to be saved in the model folder which one of you kindly informed me of after I got the No module named ‘fasttext’ error：

Please note that the fasttext library must use the specific pre-trained model called lid.176.bin which should be saved to the model folder and can be downloaded here.

Your test case in README needs to include line breaks / when passed to nlp.NLPFrame as the dictionary value
I had to run pip install pycountry after receiving an error message so you should also list that under Dependencies
I was able to run tests in a Jupyter notebook despite receiving the following message after importing nlp from nlpsummarize which you may want to investigate:

There seems to be a bug which resulted in the stop word count being 179 for all the short test sentence(s) I tried including the one provided in your README and the ones in nlp.py
The performance of the functions is better for English which I assume is because of a larger corpus in fastText? For example, the polarity function was able to correctly identify the presence of one positive word and zero negative words in your test case "I love travelling to Japan and eating Mexican food but I can only speak English!". But it was not able to detect any positive or negative words in the following French or Chinese sentences:

French: "Bonjour mon ami! J'aime manger du poisson mais je deteste la viande" (Translation: "Hello my friend! I love eating fish, but I hate meat")
Chinese: “你好！我爱睡觉但是我不喜欢学习!" (Translation: "Hello! I love sleeping but I don't like studying!"

Great work on this. It was a pleasure reviewing your package. I look forward to your replies to address my feedback and to using nlpsummarize in the future!

George

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all user-facing functions
[x] Examples for all user-facing functions
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING.
[x] Metadata including author(s), author e-mail(s), a url, and any other relevant metadata e.g., in a setup.py file or elsewhere.

Readme requirements The package meets the readme requirements below:

[x] Package has a README.md file in the root directory.

The README should include, from top to bottom:

[x] The package name
[x] Badges for continuous integration and test coverage, the badge for pyOpenSci peer-review once it has started (see below), a repostatus.org badge, and any other badges. If the README has many more badges, you might want to consider using a table for badges, see this example, that one and that one. Such a table should be more wide than high.
[x] Short description of goals of package, with descriptive links to all vignettes (rendered, i.e. readable, cf the documentation website section) unless the package is small and there’s only one vignette repeating the README.
[x] Installation instructions
[ ] Any additional setup required (authentication tokens, etc)
[x] Brief demonstration usage
[ ] Direction to more detailed documentation (e.g. your documentation files or website).
[x] If applicable, how the package compares to other similar packages and/or how it relates to other packages
[x] Citation information

Functionality

[x] Installation: Installation succeeds as documented.
[x] Functionality: Any functional claims of the software been confirmed.
[x] Performance: Any performance claims of the software been confirmed.
[x] Automated tests: Tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[x] Continuous Integration: Has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines.

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition in their submission requirements.

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software
[ ] Authors: A list of authors with their affiliations
[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.
[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing:

Review Comments

Hi, Team nlpsummarize,

It was great going through your package. I really liked the motivation behind this package, and you guys have done a wonderful job in implementing your ideas. The code was easy to read because of proper comments & print statements. I also liked the automatic definition & references that pop up in the README, which makes it easier to browse through the repo. Capture2

Given the time constraint, I appreciate your efforts in creating this package with all its functionality. Here is my feedback of your package, that can help you refine most of the rough edges:

I would suggest modifying the description of sentence_stopwords_freq to make it more clear because I am not sure how you are checking 'the proportion on the number of sentences'.
It would be nice to include a link to the documentation (readthedocs.org & test.pypi.org).
In the usage section, I think there is a typo, it should be nlpsummarize instead of nlpsummarizer.
I had some trouble downloading the fasttext package in Windows but worked fine on Linux, not sure if it has something to do with my system. Apart from fasttext, I had to install pycountry & nltk.download('universal_tagset') was required. Also, when I try to import fasttext from a location other than the root of the repo, I get a message saying 'Something went wrong when downloading!!', I would suggest to modify it and say 'Try importing from the root of the package repo'.
Maybe put a warning at line 59 of nlp.py saying that 'the data frame has no text column'.
There is no clear error message when I use this data for testing functions of the package df = nlp.NLPFrame({'text_col' : [4,5,6]}, column='text_col'). It should give a warning like in other cases, that no text column detected. Please see the snapshot attached.
In the function docstring, maybe use nlpsummarize.NLPFrame instead of nlp.NLPFrame because the user doesn't know about the alias used for nlpsummarize.
At line 418 & 442 of nlp.py, it might be good to write the filenames present in the repo or write it like this or .
If possible, try to increase code coverage to 90%.

Overall, I enjoyed going through your package and I hope that my review will help make your package better. Looking forward to having a productive conversation with all of you.

Thanks & Regards, Shivam Verma

Thank you for the feedback everyone. As per your suggestions, we edited the documentation in the README file and fixed up the sentence_stopwords_freq method documentation for greater clarity. Other minor typos and syntax errors were addressed to ensure the code works properly. Please see the final release here. Thanks again!

Hi @gptzj and @vermashivam679 . Just wanted to give more detailed information what we changed and what issues have been addressed by us.

@gptzjs here are what we addressed from your comments:

3) Necessity of model folder: Now user can specify the path where the downloaded model will be stored. It is not hardcoded in model folder, but has default value pointing to the current directory.

4) Example of usage typo: Done!!

5) Fixed!!

6) The following messages say that the fasttext pretrained model haven't downloaded because of either internet issue or the issue mentioned in your 3rd commend. The other lines are not error messages, rather they inform you that one of the columns has been automatically picked for summary as you explicitely didn't mention any when creating the instance. For manually select a column you can use thecolumn argument.

7) Fixed!!

8) Yes you are correct. We haven't written any ML model for the tasks, but used ones provided by nltk and fasttext packages. Therefore, the performance of the summary generated when using our package is highly dependent on the the performance of above-mentioned packages.

Thanks for your great feedback, we really appreciate that.

@vermashivam679 here are what we addressed from your comments:

1) Done

2) Done

3) Fixed!!

4) All the mentioned issues should be solved. nltk.download('universal_tagset') is made automatically so that user doesn't bother running all these manually.

5) We have message saying "Either column parameter is not defined or it is not present in the NLPFrame." Then our package tries to select column automatically. If a column (or columns) exists, we pick the first column with text. If column is not present, we don't do anything. The reason is that NLPFrame inherits from pd.DataFrame, thus any operation, like concatenation or column addition is possible. User can add column with text and specify that column later when running any of our functions by giving column arguemt a value. Otherwise, if user wants to run summary on NLPFrame without any text column, we raise ValueError with valid error message.

7) Our class is implemented in nlp.py file, so the default import statement should be from nlpsummarize import nlp as mentioned in the README.md. Without any alias, our class is implemented in nlp not nlpsummarize file.

8) Done!!

9) We have two main files, one of which is the implementation of our class, while the other handles the dependency issues. The dependency issues may raise if a user doesn't have pre-trained models downloaded. The functions there automatically handle that case and download the weights of deep learning models that we use and store in mentioned locations.

The problem with dependency file is that we are not able (or at least may be very difficult) to recreate the situation when a user has or doesn't have the files downloaded. If you take a look into the Codeconv report, you can see that our main nlp.py file has code coverage more than 92%, while dependency file has code coverage around 50% because of the issues described above. We can remove that file and write all instructions in the README.md file for post-installation guides, but it will make our users unhappy, because nobody wants to do some additional work. This is the reason that at the end we have overall code coverage 87%.

Moreover, in the nlp.py file, we also cannot increase our coverage more, because uncovered lines (given from the report of Codecov) are ones that depend on dependency.py file's functions. Therefore, we can increase this code coverage only in the case when we delete that file, which is not the decision that we want to make.

Not addressed:

6) In this case, you explicitly say that the column with text is 'text_column', which is not true. The error message could have been better and generated by us, but we rely on the message generated by underlying function. It is not always a good choice, but sometimes because of some reasons, we can't handle all cases that user may enter and address them. Sometimes, functions should rely on explicit values given by users. For example, when a user uses pd.read_csv(.) file where columns are separated by ';' and not ',', the function just reads the file with the default symbol, i.e. ','. Additionaly information given by a user explicitly, we think should be correct, otherwise user may expect that something will go wrong.

@vermashivam679 thanks you as well for very informative feedback.

Hope this will clarify every issues seen by you!

UBC-MDS / software-review