DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.14k stars 333 forks source link

why the keyword phrase include a PRON, like "it" #271

Open chencjiajy opened 10 months ago

chencjiajy commented 10 months ago

I have run the following code snippet, the output including word "it", pos_kept don't include the PRON.

import spacy
import pytextrank

nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank", config={'pos_kept': ["NOUN", "PROPN", "VERB"]})

text = '''The MCU SDK for WRG1 general firmware has been launched, and it can be automatically generated after creating the product.'''
doc = nlp(text)

for phrase in doc._.phrases[:10]:
    print(phrase.text, phrase.rank, phrase.count, phrase.chunks)

## the output is 
# the product 0.12286712485174818 1 [the product]
# WRG1 general firmware 0.10712303413227088 1 [WRG1 general firmware]
# The MCU SDK 0.0834726982382997 1 [The MCU SDK]
# it 0.0 1 [it]
ceteri commented 10 months ago

Hi @chencjiajy, great question.

The library considers noun chunks and apparently spaCy parses the term it as that.

The coreference capabilities for spaCy are currently marked "experimental", which is a nice way to say "Good luck installing and running this part in production" :) I've evaluated multiple options for coreference (including the AllenNLP integration) and they each seem to have serious limitations. That said, if these capabilities were available, it would be relatively simple to resolve a pronoun reference within the graph. In that case, the term it would add more weight to The MCU SDK instead.

If you want, the term it might be good to add to the stop words list for your application?

chencjiajy commented 10 months ago

Hi, @ceteri , I found it's not useful to add item it to the stop words list, and the same as other single PRON words. Because pos_kept don't include the PRON, I don't need to add a single PRON word to stop words. In the code of function _collect_phrases atbase.py, pytextrank will exclude single PRON word that not be included in the pos_kept. So for single PRON word, it's rank will always be 0.0, So what I need to do is to filter the phrase it's rank is equal to zero.

        phrases: typing.Dict[Span, float] = {
            span: sum(
                ranks[Lemma(token.lemma_, token.pos_)]
                for token in span
                if self._keep_token(token)
            )
            for span in spans
        }