UBC-MDS / software-review

MDS Software Peer Review of MDS-created packages
1 stars 0 forks source link

Submission: nlpsummarize (Python) #24

Closed vigchandra closed 4 years ago

vigchandra commented 4 years ago

name: nlpsummarize about: Python package that provides a nice summary of all character columns in a pandas dataframe.

Submitting Author: Vignesh Chandrasekaran (@vigchandra), Karlos Muradyan (@KarlosMuradyan), Karanpal Singh (@singh-karanpal) , Sam Chepal (@schepal) Repository: https://github.com/UBC-MDS/nlpsummarize Version submitted: 1.1.0 Editor: Varada (@kvarada ) Reviewer 1: Shivam Verma (@vermashivam679) Reviewer 2: George Thio (@gptzjs ) Archive: TBD Version accepted: TBD


Description

One of the most relevant applications of machine learning for corporations globally is the use of natural language processing (NLP). Whether it be parsing through business documents to gather key word segments or detecting Twitter sentiment of a certain product, NLP’s use case is prevalent in virtually every business environment.

Our library specifically will make extensive use of pre-existing packages in the Python eco-system. We will use the nltk library to build most of the sentiment analysis functions while also leveraging well-known packages such as pandas to aid in the overall presentation of our final output results.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Any data scientist who deals with textual data would be likely using this package to get quick summaries of the data that they would be dealing with.

Unfortunately, there are few tools today which provide summary statistics on textual data that a user may want to analyze. Our goal with this package is to provide users with a simple and flexible tool to gather key insights that would be useful during the exploratory data analysis phase of the data science workflow.

To the best of our knowledge, there is no any other package that combines all the below mentioned functionality in one.

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

gptzjs commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing: 2.5 hours


Review Comments

Hi Vignesh, Karanpal, Samneet and Karlos!

First of all, I think this is a great package idea which has a lot of utility in a field which I am really interested in to the point that I contemplated doing MDS-CL.

Please find below my review and feedback for your consideration.

  1. Function documentation was concise, informative and effective at explaining the utility of the package which immediately resonated with me as I just had a conversation with my best friend this morning who is managing partner of a major law firm in Vancouver about how to incorporate ML to improve business intelligence and streamline lawyer workflow and productivity. So immediately thought of nlpsummarize!

  2. Package Installation was quick and easy and I was able to run your good mix of tests and tests of my own in English, French, and Simplified Chinese (didn't expect that to work!) once I was able to install all the dependencies, but encountered some minor issues which are noted below.

  3. Suggest amending the README instruction under Dependencies as shown below to specify that the file needs to be saved in the model folder which one of you kindly informed me of after I got the No module named ‘fasttext’ error

Please note that the fasttext library must use the specific pre-trained model called lid.176.bin which should be saved to the model folder and can be downloaded here.

  1. Your test case in README needs to include line breaks / when passed to nlp.NLPFrame as the dictionary value

  2. I had to run pip install pycountry after receiving an error message so you should also list that under Dependencies

  3. I was able to run tests in a Jupyter notebook despite receiving the following message after importing nlp from nlpsummarize which you may want to investigate:

nlpsummarize_download_error
  1. There seems to be a bug which resulted in the stop word count being 179 for all the short test sentence(s) I tried including the one provided in your README and the ones in nlp.py

  2. The performance of the functions is better for English which I assume is because of a larger corpus in fastText? For example, the polarity function was able to correctly identify the presence of one positive word and zero negative words in your test case "I love travelling to Japan and eating Mexican food but I can only speak English!". But it was not able to detect any positive or negative words in the following French or Chinese sentences:

Great work on this. It was a pleasure reviewing your package. I look forward to your replies to address my feedback and to using nlpsummarize in the future!

George

vermashivam679 commented 4 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme requirements The package meets the readme requirements below:

The README should include, from top to bottom:

Functionality

For packages co-submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)

Estimated hours spent reviewing:


Review Comments

Hi, Team nlpsummarize,

It was great going through your package. I really liked the motivation behind this package, and you guys have done a wonderful job in implementing your ideas. The code was easy to read because of proper comments & print statements. I also liked the automatic definition & references that pop up in the README, which makes it easier to browse through the repo. Capture2

Given the time constraint, I appreciate your efforts in creating this package with all its functionality. Here is my feedback of your package, that can help you refine most of the rough edges:

Overall, I enjoyed going through your package and I hope that my review will help make your package better. Looking forward to having a productive conversation with all of you.

Thanks & Regards, Shivam Verma

schepal commented 4 years ago

Thank you for the feedback everyone. As per your suggestions, we edited the documentation in the README file and fixed up the sentence_stopwords_freq method documentation for greater clarity. Other minor typos and syntax errors were addressed to ensure the code works properly. Please see the final release here. Thanks again!

KarlosMuradyan commented 4 years ago

Hi @gptzj and @vermashivam679 . Just wanted to give more detailed information what we changed and what issues have been addressed by us.

@gptzjs here are what we addressed from your comments:

3) Necessity of model folder: Now user can specify the path where the downloaded model will be stored. It is not hardcoded in model folder, but has default value pointing to the current directory.

4) Example of usage typo: Done!!

5) Fixed!!

6) The following messages say that the fasttext pretrained model haven't downloaded because of either internet issue or the issue mentioned in your 3rd commend. The other lines are not error messages, rather they inform you that one of the columns has been automatically picked for summary as you explicitely didn't mention any when creating the instance. For manually select a column you can use thecolumn argument.

7) Fixed!!

8) Yes you are correct. We haven't written any ML model for the tasks, but used ones provided by nltk and fasttext packages. Therefore, the performance of the summary generated when using our package is highly dependent on the the performance of above-mentioned packages.

Thanks for your great feedback, we really appreciate that.

@vermashivam679 here are what we addressed from your comments:

1) Done

2) Done

3) Fixed!!

4) All the mentioned issues should be solved. nltk.download('universal_tagset') is made automatically so that user doesn't bother running all these manually.

5) We have message saying "Either column parameter is not defined or it is not present in the NLPFrame." Then our package tries to select column automatically. If a column (or columns) exists, we pick the first column with text. If column is not present, we don't do anything. The reason is that NLPFrame inherits from pd.DataFrame, thus any operation, like concatenation or column addition is possible. User can add column with text and specify that column later when running any of our functions by giving column arguemt a value. Otherwise, if user wants to run summary on NLPFrame without any text column, we raise ValueError with valid error message.

7) Our class is implemented in nlp.py file, so the default import statement should be from nlpsummarize import nlp as mentioned in the README.md. Without any alias, our class is implemented in nlp not nlpsummarize file.

8) Done!!

9) We have two main files, one of which is the implementation of our class, while the other handles the dependency issues. The dependency issues may raise if a user doesn't have pre-trained models downloaded. The functions there automatically handle that case and download the weights of deep learning models that we use and store in mentioned locations.

The problem with dependency file is that we are not able (or at least may be very difficult) to recreate the situation when a user has or doesn't have the files downloaded. If you take a look into the Codeconv report, you can see that our main nlp.py file has code coverage more than 92%, while dependency file has code coverage around 50% because of the issues described above. We can remove that file and write all instructions in the README.md file for post-installation guides, but it will make our users unhappy, because nobody wants to do some additional work. This is the reason that at the end we have overall code coverage 87%.

Moreover, in the nlp.py file, we also cannot increase our coverage more, because uncovered lines (given from the report of Codecov) are ones that depend on dependency.py file's functions. Therefore, we can increase this code coverage only in the case when we delete that file, which is not the decision that we want to make.

Not addressed:

6) In this case, you explicitly say that the column with text is 'text_column', which is not true. The error message could have been better and generated by us, but we rely on the message generated by underlying function. It is not always a good choice, but sometimes because of some reasons, we can't handle all cases that user may enter and address them. Sometimes, functions should rely on explicit values given by users. For example, when a user uses pd.read_csv(.) file where columns are separated by ';' and not ',', the function just reads the file with the default symbol, i.e. ','. Additionaly information given by a user explicitly, we think should be correct, otherwise user may expect that something will go wrong.

@vermashivam679 thanks you as well for very informative feedback.

Hope this will clarify every issues seen by you!