Closed Ankush-Chander closed 3 years ago
The stopwords should be filtered out from the graph due to https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/base.py#L465 , which you can verify with the doc._.textank.lemma_graph
property, as well as like you point out by the centrality which doesn't account for stop words.
What you're looking for is to filter out stop words from the candidate phrases generation. 2 bits of the code are relevant:
span
can contain stop words (which is fine)self.scrubber
So maybe you can try something like
def stop_words_scrubber (text: str) -> str:
return " ".join([w for w in text.split() if w not in stop_words])
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] }, "scrubber": stop_words_scrubber})
the code for the scrubber is very much pseudo code so bare with me @Ankush-Chander
Thanks @louisguitton . These pointers were quite helpful. Scrubber is very convenient for manipulating presentation related stuff. While stop word list takes care of controlling what goes in textrank calculation.
While trying out the scrubber pseudocode I witnessed following error:
Traceback (most recent call last):
File "/home/ankush/.config/JetBrains/PyCharm2021.1/scratches/scratch_5.py", line 26, in <module>
spacy_nlp.add_pipe("textrank", config={"stopwords": {"test": ["NOUN"]}, "scrubber":scrubber_func},last=True)
File "/home/ankush/workplace/.virtualenv/pytextrank/lib/python3.8/site-packages/spacy/language.py", line 767, in add_pipe
pipe_component = self.create_pipe(
File "/home/ankush/workplace/.virtualenv/pytextrank/lib/python3.8/site-packages/spacy/language.py", line 629, in create_pipe
raise ValueError(Errors.E961.format(config=config))
ValueError: [E961] Found non-serializable Python object in config. Configs should only include values that can be serialized to JSON. If you need to pass models or other objects to your component, use a reference to a registered function or initialize the object in your component.
{'stopwords': {'test': ['NOUN']}, 'scrubber': <function scrubber_func at 0x7fd6a4835280>}
As found here
Config values you pass to components need to be JSON-serializable and can’t be arbitrary Python objects. Otherwise, the settings you provide can’t be represented in the config.cfg and spaCy has no way of knowing how to re-create your component with the same settings when you load the pipeline back in. If you need to pass arbitrary objects to a component, use a registered function:
Adding scrubber the following way worked for me.
import spacy
import pytextrank
spacy_model = "en_core_web_sm"
spacy_nlp = spacy.load(name=spacy_model)
# create a registered scrubber function
@spacy.registry.misc("stop_words_scrubber")
def stop_words_scrubber():
def scrubber_func(text: str) -> str:
articles = ["a", "the", "their"]
return " ".join([w for w in text.split() if w not in articles])
return scrubber_func
spacy_nlp.add_pipe("textrank", config={"stopwords": {"test": ["NOUN"]}, "scrubber": {"@misc": "stop_words_scrubber"}}, last=True)
@ceteri I intend to add the scrubber usage in documentation after this section.
Is their a way I can prevent stop words from being part of ranked phrases?
For example,
I am getting following variations in ranked list
desired list
I was trying to avoid redundant variations by setting
a, each, every, one, the, their
in stop word config but I figured that purpose of current stopword config is just to avoid those words from participating astextrank nodes
.Thanks in advance.