Closed vigchandra closed 4 years ago
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing: 2.5 hours
Hi Vignesh, Karanpal, Samneet and Karlos!
First of all, I think this is a great package idea which has a lot of utility in a field which I am really interested in to the point that I contemplated doing MDS-CL.
Please find below my review and feedback for your consideration.
Function documentation was concise, informative and effective at explaining the utility of the package which immediately resonated with me as I just had a conversation with my best friend this morning who is managing partner of a major law firm in Vancouver about how to incorporate ML to improve business intelligence and streamline lawyer workflow and productivity. So immediately thought of nlpsummarize
!
Package Installation was quick and easy and I was able to run your good mix of tests and tests of my own in English, French, and Simplified Chinese (didn't expect that to work!) once I was able to install all the dependencies, but encountered some minor issues which are noted below.
Suggest amending the README instruction under Dependencies
as shown below to specify that the file needs to be saved in the model
folder which one of you kindly informed me of after I got the No module named ‘fasttext’ error
:
Please note that the fasttext
library must use the specific pre-trained model called lid.176.bin
which should be saved to the model
folder and can be downloaded here.
Your test case in README needs to include line breaks /
when passed to nlp.NLPFrame
as the dictionary value
I had to run pip install pycountry
after receiving an error message so you should also list that under Dependencies
I was able to run tests in a Jupyter notebook despite receiving the following message after importing nlp
from nlpsummarize
which you may want to investigate:
There seems to be a bug which resulted in the stop word
count being 179 for all the short test sentence(s) I tried including the one provided in your README and the ones in nlp.py
The performance of the functions is better for English which I assume is because of a larger corpus in fastText? For example, the polarity
function was able to correctly identify the presence of one positive word and zero negative words in your test case "I love travelling to Japan and eating Mexican food but I can only speak English!"
. But it was not able to detect any positive or negative words in the following French or Chinese sentences:
French: "Bonjour mon ami! J'aime manger du poisson mais je deteste la viande" (Translation: "Hello my friend! I love eating fish, but I hate meat")
Chinese: “你好!我爱睡觉但是我不喜欢学习!" (Translation: "Hello! I love sleeping but I don't like studying!"
Great work on this. It was a pleasure reviewing your package. I look forward to your replies to address my feedback and to using nlpsummarize
in the future!
George
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
setup.py
file or elsewhere.Readme requirements The package meets the readme requirements below:
The README should include, from top to bottom:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
Estimated hours spent reviewing:
Hi, Team nlpsummarize
,
It was great going through your package. I really liked the motivation behind this package, and you guys have done a wonderful job in implementing your ideas. The code was easy to read because of proper comments & print statements. I also liked the automatic definition & references that pop up in the README, which makes it easier to browse through the repo.
Given the time constraint, I appreciate your efforts in creating this package with all its functionality. Here is my feedback of your package, that can help you refine most of the rough edges:
I would suggest modifying the description of sentence_stopwords_freq
to make it more clear because I am not sure how you are checking 'the proportion on the number of sentences'.
It would be nice to include a link to the documentation (readthedocs.org
& test.pypi.org
).
In the usage section, I think there is a typo, it should be nlpsummarize
instead of nlpsummarizer
.
I had some trouble downloading the fasttext
package in Windows but worked fine on Linux, not sure if it has something to do with my system. Apart from fasttext
, I had to install pycountry
& nltk.download('universal_tagset')
was required. Also, when I try to import fasttext
from a location other than the root of the repo, I get a message saying 'Something went wrong when downloading!!', I would suggest to modify it and say 'Try importing from the root of the package repo'.
Maybe put a warning at line 59 of nlp.py
saying that 'the data frame has no text column'.
There is no clear error message when I use this data for testing functions of the package df = nlp.NLPFrame({'text_col' : [4,5,6]}, column='text_col')
. It should give a warning like in other cases, that no text column detected. Please see the snapshot attached.
In the function docstring, maybe use nlpsummarize.NLPFrame
instead of nlp.NLPFrame
because the user doesn't know about the alias used for nlpsummarize
.
At line 418 & 442 of nlp.py
, it might be good to write the filenames present in the repo or write it like this
If possible, try to increase code coverage to 90%.
Overall, I enjoyed going through your package and I hope that my review will help make your package better. Looking forward to having a productive conversation with all of you.
Thanks & Regards, Shivam Verma
Thank you for the feedback everyone. As per your suggestions, we edited the documentation in the README file and fixed up the sentence_stopwords_freq
method documentation for greater clarity. Other minor typos and syntax errors were addressed to ensure the code works properly. Please see the final release here. Thanks again!
Hi @gptzj and @vermashivam679 . Just wanted to give more detailed information what we changed and what issues have been addressed by us.
@gptzjs here are what we addressed from your comments:
3) Necessity of model
folder: Now user can specify the path where the downloaded model will be stored. It is not hardcoded in model
folder, but has default value pointing to the current directory.
4) Example of usage typo: Done!!
5) Fixed!!
6) The following messages say that the fasttext pretrained model haven't downloaded because of either internet issue or the issue mentioned in your 3rd commend. The other lines are not error messages, rather they inform you that one of the columns has been automatically picked for summary as you explicitely didn't mention any when creating the instance. For manually select a column you can use thecolumn
argument.
7) Fixed!!
8) Yes you are correct. We haven't written any ML model for the tasks, but used ones provided by nltk
and fasttext
packages. Therefore, the performance of the summary generated when using our package is highly dependent on the the performance of above-mentioned packages.
Thanks for your great feedback, we really appreciate that.
@vermashivam679 here are what we addressed from your comments:
1) Done
2) Done
3) Fixed!!
4) All the mentioned issues should be solved. nltk.download('universal_tagset')
is made automatically so that user doesn't bother running all these manually.
5) We have message saying "Either column parameter is not defined or it is not present in the NLPFrame." Then our package tries to select column automatically. If a column (or columns) exists, we pick the first column with text. If column is not present, we don't do anything. The reason is that NLPFrame
inherits from pd.DataFrame
, thus any operation, like concatenation or column addition is possible. User can add column with text and specify that column later when running any of our functions by giving column
arguemt a value. Otherwise, if user wants to run summary on NLPFrame
without any text column, we raise ValueError
with valid error message.
7) Our class is implemented in nlp.py
file, so the default import statement should be from nlpsummarize import nlp
as mentioned in the README.md
. Without any alias, our class is implemented in nlp
not nlpsummarize
file.
8) Done!!
9) We have two main files, one of which is the implementation of our class, while the other handles the dependency issues. The dependency issues may raise if a user doesn't have pre-trained models downloaded. The functions there automatically handle that case and download the weights of deep learning models that we use and store in mentioned locations.
The problem with dependency file is that we are not able (or at least may be very difficult) to recreate the situation when a user has or doesn't have the files downloaded. If you take a look into the Codeconv report, you can see that our main nlp.py
file has code coverage more than 92%, while dependency file has code coverage around 50% because of the issues described above. We can remove that file and write all instructions in the README.md file for post-installation guides, but it will make our users unhappy, because nobody wants to do some additional work. This is the reason that at the end we have overall code coverage 87%.
Moreover, in the nlp.py
file, we also cannot increase our coverage more, because uncovered lines (given from the report of Codecov) are ones that depend on dependency.py file's functions. Therefore, we can increase this code coverage only in the case when we delete that file, which is not the decision that we want to make.
Not addressed:
6) In this case, you explicitly say that the column with text is 'text_column', which is not true. The error message could have been better and generated by us, but we rely on the message generated by underlying function. It is not always a good choice, but sometimes because of some reasons, we can't handle all cases that user may enter and address them. Sometimes, functions should rely on explicit values given by users. For example, when a user uses pd.read_csv(.)
file where columns are separated by ';' and not ',', the function just reads the file with the default symbol, i.e. ','. Additionaly information given by a user explicitly, we think should be correct, otherwise user may expect that something will go wrong.
@vermashivam679 thanks you as well for very informative feedback.
Hope this will clarify every issues seen by you!
name: nlpsummarize about: Python package that provides a nice summary of all character columns in a pandas dataframe.
Submitting Author: Vignesh Chandrasekaran (@vigchandra), Karlos Muradyan (@KarlosMuradyan), Karanpal Singh (@singh-karanpal) , Sam Chepal (@schepal) Repository: https://github.com/UBC-MDS/nlpsummarize Version submitted: 1.1.0 Editor: Varada (@kvarada ) Reviewer 1: Shivam Verma (@vermashivam679) Reviewer 2: George Thio (@gptzjs ) Archive: TBD Version accepted: TBD
Description
One of the most relevant applications of machine learning for corporations globally is the use of natural language processing (NLP). Whether it be parsing through business documents to gather key word segments or detecting Twitter sentiment of a certain product, NLP’s use case is prevalent in virtually every business environment.
Our library specifically will make extensive use of pre-existing packages in the Python eco-system. We will use the nltk library to build most of the sentiment analysis functions while also leveraging well-known packages such as pandas to aid in the overall presentation of our final output results.
Scope
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
Any data scientist who deals with textual data would be likely using this package to get quick summaries of the data that they would be dealing with.
Unfortunately, there are few tools today which provide summary statistics on textual data that a user may want to analyze. Our goal with this package is to provide users with a simple and flexible tool to gather key insights that would be useful during the exploratory data analysis phase of the data science workflow.
To the best of our knowledge, there is no any other package that combines all the below mentioned functionality in one.
@tag
the editor you contacted:Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements). Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements](https://joss.readthedocs.io/en/latest/submitting.html#submission-requirements): "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements](https://joss.readthedocs.io/en/latest/submitting.html#what-should-my-paper-contain) with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor and review templates can be found here