Open stuartspotlight opened 5 years ago
I've been experimenting with this and I think I've found a work around however it may be very computationally inefficient. I initiate the document, then set the stopwords then re-initiate the document. This not only seems to solve the issue with not being able to reset stopwords but it also seems to fix the issue with some stopwords not being picked up on the first pass. The need to do this is very odd behavior however:
import spacy
import textacy
def add_stopwords_in(doc, stopwords):
for word in stopwords:
doc.vocab[word].is_stop=True
doc = textacy.spacier.utils.make_doc_from_text_chunks(doc.text, lang=model)
return doc
#This is an example document I've made to show this issue happening
t = '''Here is an example document. It has a number of words. It is a good document.
Documents are good. Document for Documents.
Apple is a company worth over $1tr. We have to ask how many documents can a person write in a week.
The word documents is being deliberately overused. Just document it! Apples are a fruit I'm interested in.
How do you feel about apples, I'm a big fan of Apples. Is it all Apples or just the ones at the end of a sentence?'''
#initiate our model
model = spacy.load('en_core_web_sm')
#set the first set of stopwords
example_stops1 = ['Apples', 'Apple', 'apples', 'apple']
#add the stopwords to the model
#model.Defaults.stop_words |= set(example_stops1)
#create a document using the make doc from text tool in order to avoid problems
#with massive documents
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)
doc = add_stopwords_in(doc, example_stops1)
#check that the stopwords have been corectly identified
for word in doc:
if word.is_stop:
print(word)
else:
if word.text in example_stops1:
print(word, word.is_stop)
print("=====================")
del doc
doc = textacy.spacier.utils.make_doc_from_text_chunks(t, lang=model)
#remove the old stopwords
for word in example_stops1:
doc.vocab[word].is_stop=False
#now set another set of stopwords
example_stops2 = ['Document', 'Documents', 'document', 'documents']
#add the new stopwords in
doc = add_stopwords_in(doc, example_stops2)
#check that the stopwords have been corectly identified
for word in doc:
if word.is_stop:
print(word)
else:
if word.text in example_stops2:
print(word, word.is_stop)
print("=====================")
I'm having numerous issues with stopwords when working with textacy's make_doc_from_text_chunks functionality.
Expected Behavior
I want to be able to load a model and then fire documents at it in order to find keywords. I want to do this in a way whereupon I can reset the stopwords I'm using from document to document.
Current Behavior
Setting stopwords for the first document works fine but when I attempt to reset the stopwords for the next document it appears to revert the stopwords back to the default and not allow me to use a new, custom set of stopwords. It also seems to miss some stopwords on the first pass.
Possible Solution
I think a flag is being set somewhere in textacy when I call make_doc_from_chunks to set the stopwords and I can't for the life of me find a way to unset it. I would say this is a bug somewhere.
Steps to Reproduce (for bugs)
In order to ensure reproducibility I have provided both some example python code showing the bug and a Dockerfile (in the environment section) which should make it easy to reproduce the problem. Example code:-
Details of the docker container are given in the environment section.
Context
I want to create a tool which produces keywords from arbitrarily large documents with stopwords set based on their context which does not require a restart when processing a different set of documents. For example a series of financial reports should not return "fiscal" or "financial" in their keywords and the tool should not have to restart to process a series of performance reviews with "performance" set as a stopword.
Your Environment
Run in a docker container using the following code:-
and the requirements file is:-
spacy
version: 2.0.18spacy
models: en,en_core_web_smtextacy
version: 0.6.2