DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.15k stars 333 forks source link

Is their a way we can prevent stop words from being part of ranked phrases? #172

Closed Ankush-Chander closed 3 years ago

Ankush-Chander commented 3 years ago

Is their a way I can prevent stop words from being part of ranked phrases?

For example,

I am getting following variations in ranked list

the semantic similarity|0.04607325065321578|8
their semantic similarity|0.04607325065321578|2

a document|0.041466949367606365|6
each document|0.041466949367606365|4
every document|0.041466949367606365|2
one document|0.041466949367606365|2
the document|0.041466949367606365|10

desired list

semantic similarity
document

I was trying to avoid redundant variations by setting a, each, every, one, the, their in stop word config but I figured that purpose of current stopword config is just to avoid those words from participating as textrank nodes.

Thanks in advance.

louisguitton commented 3 years ago

The stopwords should be filtered out from the graph due to https://github.com/DerwenAI/pytextrank/blob/main/pytextrank/base.py#L465 , which you can verify with the doc._.textank.lemma_graph property, as well as like you point out by the centrality which doesn't account for stop words.

What you're looking for is to filter out stop words from the candidate phrases generation. 2 bits of the code are relevant:

So maybe you can try something like

def stop_words_scrubber (text: str) -> str:
    return " ".join([w for w in text.split() if w not in stop_words])

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"] }, "scrubber":  stop_words_scrubber})

the code for the scrubber is very much pseudo code so bare with me @Ankush-Chander

Ankush-Chander commented 3 years ago

Thanks @louisguitton . These pointers were quite helpful. Scrubber is very convenient for manipulating presentation related stuff. While stop word list takes care of controlling what goes in textrank calculation.

While trying out the scrubber pseudocode I witnessed following error:

Traceback (most recent call last):
  File "/home/ankush/.config/JetBrains/PyCharm2021.1/scratches/scratch_5.py", line 26, in <module>
    spacy_nlp.add_pipe("textrank", config={"stopwords": {"test": ["NOUN"]}, "scrubber":scrubber_func},last=True)
  File "/home/ankush/workplace/.virtualenv/pytextrank/lib/python3.8/site-packages/spacy/language.py", line 767, in add_pipe
    pipe_component = self.create_pipe(
  File "/home/ankush/workplace/.virtualenv/pytextrank/lib/python3.8/site-packages/spacy/language.py", line 629, in create_pipe
    raise ValueError(Errors.E961.format(config=config))
ValueError: [E961] Found non-serializable Python object in config. Configs should only include values that can be serialized to JSON. If you need to pass models or other objects to your component, use a reference to a registered function or initialize the object in your component.

{'stopwords': {'test': ['NOUN']}, 'scrubber': <function scrubber_func at 0x7fd6a4835280>}

As found here

Config values you pass to components need to be JSON-serializable and can’t be arbitrary Python objects. Otherwise, the settings you provide can’t be represented in the config.cfg and spaCy has no way of knowing how to re-create your component with the same settings when you load the pipeline back in. If you need to pass arbitrary objects to a component, use a registered function:

Adding scrubber the following way worked for me.

import spacy
import pytextrank
spacy_model = "en_core_web_sm"
spacy_nlp = spacy.load(name=spacy_model)

# create a registered scrubber function
@spacy.registry.misc("stop_words_scrubber")
def stop_words_scrubber():
    def scrubber_func(text: str) -> str:
        articles = ["a", "the", "their"]
        return " ".join([w for w in text.split() if w not in articles])
    return scrubber_func

spacy_nlp.add_pipe("textrank", config={"stopwords": {"test": ["NOUN"]}, "scrubber": {"@misc": "stop_words_scrubber"}}, last=True)

@ceteri I intend to add the scrubber usage in documentation after this section.