chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Issues with stopwords when working with make_doc_from_text_chunks #230

Open stuartspotlight opened 5 years ago

stuartspotlight commented 5 years ago

I'm having numerous issues with stopwords when working with textacy's make_doc_from_text_chunks functionality.

Expected Behavior

I want to be able to load a model and then fire documents at it in order to find keywords. I want to do this in a way whereupon I can reset the stopwords I'm using from document to document.

Current Behavior

Setting stopwords for the first document works fine but when I attempt to reset the stopwords for the next document it appears to revert the stopwords back to the default and not allow me to use a new, custom set of stopwords. It also seems to miss some stopwords on the first pass.

Possible Solution

I think a flag is being set somewhere in textacy when I call make_doc_from_chunks to set the stopwords and I can't for the life of me find a way to unset it. I would say this is a bug somewhere.

Steps to Reproduce (for bugs)

In order to ensure reproducibility I have provided both some example python code showing the bug and a Dockerfile (in the environment section) which should make it easy to reproduce the problem. Example code:-

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Feb  1 10:40:22 2019

@author: stuart
"""

#reporting textacys stopwords bugs

import spacy
import textacy

#This is an example document I've made to show this issue happening
t = '''Here is an example document. It has a number of words. It is a good document.

Documents are good. Document for Documents.

Apple is a company worth over $1tr. We have to ask how many documents can a person write in a week.
The word documents is being deliberately overused. Just document it! Apples are a fruit I'm interested in.
How do you feel about apples, I'm a big fan of Apples. Is it all Apples or just the ones at the end of a sentence?'''

#problem 1. Unable to change stopwords of initiated model

#initiate our model
model = spacy.load('en_core_web_sm')

#set the first set of stopwords
example_stops1 = ['Apples', 'Apple', 'apples', 'apple']

#add the stopwords to the model
model.Defaults.stop_words |= set(example_stops1)

#create a document using the make doc from text tool in order to avoid problems
#with massive documents
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)

#check that the stopwords have been corectly identified
for word in doc:

    if word.is_stop:
        print(word)

print("=====================")

#remove the first set of stopwords from the list of stopwords, this seems to work
#ok
model.Defaults.stop_words -= set(example_stops1)

#now set another set of stopwords 
example_stops2 = ['Document', 'Documents', 'document', 'documents']

#add the new set of stopwords to the model
model.Defaults.stop_words |= set(example_stops2)

#demonstrate that all the stopwords we want to be there are there
print("+++++++++++++++++++++++++++++++++++++")
print(model.Defaults.stop_words)
print("+++++++++++++++++++++++++++++++++++++")

#create a second document without our new set of stopwords
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model) 

#check that stopwords set 2 have been correctly identified
for word in doc:

    if word.is_stop:
        print(word)
    else:
        if word.text in example_stops2:
            print(word, word.is_stop)

print("+++++++++++++++++++++++")

Details of the docker container are given in the environment section.

Context

I want to create a tool which produces keywords from arbitrarily large documents with stopwords set based on their context which does not require a restart when processing a different set of documents. For example a series of financial reports should not return "fiscal" or "financial" in their keywords and the tool should not have to restart to process a series of performance reviews with "performance" set as a stopword.

Your Environment

Run in a docker container using the following code:-

FROM python:3.6

RUN apt-get update
RUN apt-get install build-essential -y

#install basic requirements
ADD ./requirements.txt /
RUN pip install -r /requirements.txt

#add nltk model
RUN python -m nltk.downloader 'punkt'

#Install spacy model
RUN python -m spacy download en

ADD reporting_textacys_stopwords_bug.py /

CMD ["python", "reporting_textacys_stopwords_bug.py"]

and the requirements file is:-

numpy==1.16.1
nltk==3.3
rake-nltk==1.0.1
scipy==1.0.0
spacy==2.0.18
textacy==0.6.2
stuartspotlight commented 5 years ago

I've been experimenting with this and I think I've found a work around however it may be very computationally inefficient. I initiate the document, then set the stopwords then re-initiate the document. This not only seems to solve the issue with not being able to reset stopwords but it also seems to fix the issue with some stopwords not being picked up on the first pass. The need to do this is very odd behavior however:

import spacy
import textacy

def add_stopwords_in(doc, stopwords):

    for word in stopwords:

        doc.vocab[word].is_stop=True

    doc = textacy.spacier.utils.make_doc_from_text_chunks(doc.text, lang=model)

    return doc

#This is an example document I've made to show this issue happening
t = '''Here is an example document. It has a number of words. It is a good document.

Documents are good. Document for Documents.

Apple is a company worth over $1tr. We have to ask how many documents can a person write in a week.
The word documents is being deliberately overused. Just document it! Apples are a fruit I'm interested in.
How do you feel about apples, I'm a big fan of Apples. Is it all Apples or just the ones at the end of a sentence?'''

#initiate our model
model = spacy.load('en_core_web_sm')

#set the first set of stopwords
example_stops1 = ['Apples', 'Apple', 'apples', 'apple']

#add the stopwords to the model
#model.Defaults.stop_words |= set(example_stops1)

#create a document using the make doc from text tool in order to avoid problems
#with massive documents
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)

doc = add_stopwords_in(doc, example_stops1)
#check that the stopwords have been corectly identified
for word in doc:

    if word.is_stop:
        print(word)
    else:
        if word.text in example_stops1:
            print(word, word.is_stop)

print("=====================")

del doc

doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)

#remove the old stopwords
for word in example_stops1:
    doc.vocab[word].is_stop=False

#now set another set of stopwords 
example_stops2 = ['Document', 'Documents', 'document', 'documents']

#add the new stopwords in
doc = add_stopwords_in(doc, example_stops2)

#check that the stopwords have been corectly identified
for word in doc:

    if word.is_stop:
        print(word)
    else:
        if word.text in example_stops2:
            print(word, word.is_stop)

print("=====================")