DerwenAI / pytextrank

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction
https://derwen.ai/docs/ptr/
MIT License
2.15k stars 333 forks source link

AttributeError: [E046] while summarizing with PositionRank/Biased TextRank #173

Closed zykerli closed 1 year ago

zykerli commented 3 years ago

Hello, I'm trying to implement your provided PositionRank and Biased TextRank algorithms for the German language with the following code.

import spacy
spacy_model = "de_core_news_lg"

spacy_nlp = spacy.load(name=spacy_model,disable=["lemmatizer"])
spacy_nlp.add_pipe(factory_name="positionrank", name="positionrank", last=True)

text = "Das ist ein Test. Bitte fasse mich zusammen!"

import pytextrank
doc = spacy_nlp(text)

summary = list(doc._.positionalrank.summary(limit_phrases=1, limit_sentences=1, preserve_order=False))

Unfortunately, it throws some AttributeError: [E046]. It looks like the ._.positionalrank is not implemented. The same code works fine when replacing "positionrank" with "textrank" (using doc._.textrank). I'm using pytextrank version 3.1.1

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-117-5f8747633f63> in <module>
     11 
     12 
---> 13 summary = list(doc._.positionrank.summary(limit_phrases=1, limit_sentences=1, preserve_order=False))
     14 print(summary)

~/PycharmProjects/Test_project/venv/lib/python3.8/site-packages/spacy/tokens/underscore.py in __getattr__(self, name)
     30     def __getattr__(self, name):
     31         if name not in self._extensions:
---> 32             raise AttributeError(Errors.E046.format(name=name))
     33         default, method, getter, setter = self._extensions[name]
     34         if getter is not None:

AttributeError: [E046] Can't retrieve unregistered extension attribute 'positionrank'. Did you forget to call the `set_extension` method?

EDIT

As I can see from pytextrank/pytextrank/positionrank.py, line 23-50 (see below), PositionRank is set with set_extension, but still named as "textrank" (and not positionrank).

            def __call__ (
                self,    
                doc: Doc,
                )-> Doc:

                """
        Set the extension attributes on a `spaCy` [`Doc`](https://spacy.io/api/doc)
        document to create a *pipeline component* for `PositionRank` as
        a stateful component, invoked when the document gets processed.
        See: &lt;https://spacy.io/usage/processing-pipelines#pipelines&gt;

            doc:
        a document container, providing the annotations produced by earlier stages of the `spaCy` pipeline  
                """

                Doc.set_extension("textrank", force=True, default=None)
                Doc.set_extension("phrases", force=True, default=[])

                doc._.textrank = PositionRank( 
                    doc,
                    edge_weight = self.edge_weight,
                    pos_kept = self.pos_kept,
                    token_lookback = self.token_lookback,
                    scrubber = self.scrubber,
                    stopwords = self.stopwords, 
                    )

                doc._.phrases = doc._.textrank.calc_textrank()
                return doc

My code at the beginning compiles when changing the last line from summary = list(doc._.positionalrank.summary(limit_phrases=1, limit_sentences=1, preserve_order=False)) to summary = list(doc._.textrank.summary(limit_phrases=1, limit_sentences=1, preserve_order=False)) but is really PositionRank used or TextRank? Maybe an extension of the tutorial for the algorithms beside TextRank would clarify things

Ankush-Chander commented 3 years ago

Hey @dblaszcz, Thank you for sharing the detailed description of the issue. As you rightly figured out, applying any one for the variants "textrank", "positionrank", "biasedtextrank" attaches the extension textrank to the doc.

That can be verified by type checking doc._.textrank

print(doc._.textrank)
# returns
# in case of textrank
# <class 'pytextrank.base.BaseTextRank'>

# in case of positionrank
# <class 'pytextrank.positionrank.PositionRank'>

# in case of biasedtextrank
# <class 'pytextrank.biasedrank.BiasedTextRank'>

Also I see in the top snippet shared by you: This statement

import pytextrank

should be placed before

spacy_nlp.add_pipe(factory_name="positionrank", name="positionrank", last=True)

I hope it helps.

ceteri commented 3 years ago

I've noticed that the pipeline extensions tend to not show up in the spaCy pipeline analysis, for example when running:

print("pipeline", nlp.pipe_names)
nlp.analyze_pipes(pretty=True)

I can raise a question on the spaCy forums to find out if there are ways to register pipeline extensions.

louisguitton commented 3 years ago

I see the extension in the pipeline analysis using this snippet.

import spacy
import pytextrank

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("positionrank", last=True)

assert "positionrank" in nlp.pipe_names
assert "positionrank" in nlp.analyze_pipes()['summary']

Output looks like this for me

>>> nlp.analyze_pipes(pretty=True)['summary']
============================= Pipeline Overview =============================

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False

1   tagger            token.tag                        tag_acc            False

2   parser            token.dep                        dep_uas            False
                      token.head                       dep_las
                      token.is_sent_start              dep_las_per_type
                      doc.sents                        sents_p
                                                       sents_r
                                                       sents_f

3   ner               doc.ents                         ents_f             False
                      token.ent_iob                    ents_p
                      token.ent_type                   ents_r
                                                       ents_per_type

4   attribute_ruler                                                       False

5   lemmatizer        token.lemma                      lemma_acc          False

6   positionrank                                                          False

✔ No problems found.

maybe it's a version issue @ceteri ? (I'm using spacy=='3.0.6' and pytextrank=='3.1.2') for this test)

ceteri commented 3 years ago

Thank you @louisguitton – Looking at this again, since pytextrank is assigning custom attributes then these don't show up in the pipeline analysis.